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Abstract 


This thesis studies two problems in modern statistics. First, we study selective inference, or 
inference for hypothesis that are chosen after looking at the data. The motiving application 
is inference for regression coefficients selected by the lasso. We present the Condition-on- 
Selection method that allows for valid selective inference, and study its application to the 
lasso, and several other selection algorithms. 

In the second part, we consider the problem of learning the structure of a pairwise 
graphical model over continuous and discrete variables. We present a new pairwise model 
for graphical models with both continuous and discrete variables that is amenable to struc¬ 
ture learning. In previous work, authors have considered structure learning of Gaussian 
graphical models and structure learning of discrete models. Our approach is a natural gen¬ 
eralization of these two lines of work to the mixed case. The penalization scheme involves 
a novel symmetric use of the group-lasso norm and follows naturally from a particular 
parametrization of the model. We provide conditions under which our estimator is model 
selection consistent in the high-dimensional regime. 
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Chapter 1 

Introduction 


This thesis is split into two parts: selective inference and learning mixed graphical models. 

The contributions are summarized below: 

• Selective Inference: 

— Chapter 2: This chapter studies selective inference for the lasso-selected model. 
We show how to construct conhdence intervals for regression coefficients corre¬ 
sponding to variables selected by the lasso, and how to test the significance of 
a lasso-selected model by conditioning on the selection event of the lasso. The 
results of this chapter appear in Lee et al. (2013a) and is joint work with Dennis 
Sun, Yuekai Sun, and Jonathan Taylor. 

— Chapter 3: This chapter shows how the Condition-on-Selection method devel¬ 
oped in Chapter 2 is not specihc to the lasso. In Chapter 3.1, we show that 
controlling the conditional type 1 error implies control of the selective type 1 
error, which motivates the use of the Condition-on-Selection method to control 
conditional type 1 error. Chapter 3.2 studies several other variable selection 
methods including marginal screening, orthogonal matching pursuit, and non¬ 
negative least squares with affine selection events, so we can apply the results 
of Chapter 2. Motivated by more complicated selection algorithms that do not 
simple selection events,such as the knockoff hlter, SCAD/MCP regularizers, and 
£i-logistic regression, we develop a general algorithm that only requires a black¬ 
box evaluation of the selection algorithm in Chapter 3.3. Finally in Chapter 3.4 
we study inference for the full model regression coefficients. We show a method 
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for FDR control, and the asymptotic coverage of selective confidence intervals in 
the high-dimensional regime. This chapter is joint work with Jonathan Taylor 
and will appear in a future publication. 

• Learning Mixed Graphical Models: 

— We propose a new pairwise Markov random field that generalizes the Gaussian 
graphical model to include categorical variables. 

— We design a new regularizer that promotes edge sparsity in the mixed graphical 
model. 

— Three methods for parameter estimation are proposed: pseudoliklihood, node¬ 
wise regression, and maximum likelihood. 

— The resulting optimization problem is solved using the proximal Newton method 
Lee et al. (2012). 

— We use the framework of Lee et al. (2013b) to establish edge selection consistency 
results for the MLE and pseudolikelihood estimation methods. 

— The results of this chapter originally appeared in Lee and Hastie (2014) and is 
joint work with Trevor Hastie. 


Part I 


Selective Inference 



Chapter 2 

Selective Inference for the Lasso 

2.1 Introduction 

As a statistical technique, linear regression is both simple and powerful. Not only does it 
provide estimates of the “effect” of each variable, but it also quantifies the uncertainty in 
those estimates, paving the way for intervals and tests of the effect size. However, in many 
applications, a practitioner starts with a large pool of candidate variables, such as genes 
or demographic features, and does not know a priori which are relevant. The problem is 
especially acute if there are more variables than observations, when it is impossible to even 
fit linear regression. 

A practitioner might wish to use the data to select the relevant variables and then make 
inference on the selected variables. As an example, one might fit a linear model, observe 
which coefficients are significant at level a, and report (1 — Q;)-confidence intervals for only 
the significant coefficients. However, these intervals fail to take into account the randomness 
in the selection procedure. In particular, the intervals do not have the stated coverage once 
one marginalizes over the selected model. 

To see this formally, assume the usual linear model 

y = M + e, /i = A/ 30 , e ~ iV(0, a^I), (2.1.1) 

where X G is the design matrix and G R^. Let M C {1, ...,p} denote a (random) 
set of selected variables. Suppose the goal is inference about /3j. Then, we do not even 
form intervals for f3j when j ^ M, so the first issue is to define an interval when j ^ M 
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in order to evaluate the coverage of this procedure. There is no obvious way to do this so 
that the marginal coverage is 1 — a. Furthermore, as M varies, the target of the ordinary 
least-squares (OLS) estimator (3^^ is not /3°, but rather 



where denotes the Moore-Penrose pseudoinverse of We see that X^/3^ = 
the projection of jjL onto the columns of so represents the coefficients in the best 
linear model using only the variables in M. In general, ^ 7^ Pj unless M contains the 
support set of /3°, i.e., M D S := {j : /3j ^ 0}. Since may not be estimating /3j 

at all, there is no reason to expect a confidence interval based on it to cover Pj. Berk 
et al. (2013) provide an explicit example of the non-normality of m the post-selection 
context. In short, inference in the linear model has traditionally been incompatible with 
model selection. 

2.1.1 The Lasso 

In this paper, we focus on a particular model selection procedure, the lasso (Tibshirani, 
1996), which achieves model selection by setting coefficients to zero exactly. This is accom¬ 
plished by adding an penalty term to the usual least-squares objective: 

/3 e argmin h\y - X/3\\l + A||/3||i, (2-1.2) 

,9eiRp ^ 

where A > 0 is a penalty parameter that controls the tradeoff between fit to the data and 
sparsity of the coefficients. However, the distribution of the lasso estimator /3 is known only 
in the less interesting p case (Knight and Fu, 2000), and even then, only asymptotically. 
Inference based on the lasso estimator is still an open question. 

We apply our framework for post-selection inference about to form confidence 

intervals for ■ and to test whether the the fitted model captures all relevant signal 
variables. 

2.1.2 Related Work 

Most of the theoretical work on fitting high-dimensional linear models focuses on consis¬ 
tency. The flavor of these results is that under certain assumptions on X, the lasso fit fl 
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is close to the unknown /3^ (Negahban et ah, 2012) and selects the correct model (Zhao 
and Yu, 2006; Wainwright, 2009). A comprehensive survey of the literature can be found 
in Biihlmann and van de Geer (2011). 

There is also some recent work on obtaining confidence intervals and significance testing 
for penalized M-estimators such as the lasso. One class of methods uses sample splitting 
or subsampling to obtain confidence intervals and p-values. Recently, Meinshausen and 
Biihlmann (2010) proposed stability selection as a general technique designed to improve 
the performance of a variable selection algorithm. The basic idea is, instead of performing 
variable selection on the whole data set, to perform variable selection on random subsamples 
of the data of size ^ and include the variables that are selected most often on the subsamples. 

A separate line of work establishes the asymptotic normality of a corrected estimator 
obtained by “inverting” the KKT conditions (van de Geer et ah, 2013; Zhang and Zhang, 
2014; Javanmard and Montanari, 2013). The corrected estimator b usually has the form 

6 = /3 + A0Z, 

where z is a subgradient of the penalty at /3 and 0 is an approximate inverse to the Gram 
matrix A^A. This approach is very general and easily handles M-estimators that minimize 
the sum of a smooth convex loss and a convex penalty. The two main drawbacks to this 
approach are: 

1. the conhdence intervals are valid only when the M-estimator is consistent 

2. obtaining 0 is usually much more expensive than obtaining /3. 

Most closely related to our work is the pathwise signficance testing framework laid out 
in Lockhart et al. (2014). They establish a test for whether a newly added coefficient is 
a relevant variable. This method only allows for testing at A that are LARS knot values. 
This is a considerable restriction, since the lasso is often not solved with the LARS algo¬ 
rithm. Furthermore, the test is asymptotic, makes strong assumptions on A, and the weak 
convergence assumes that all relevant variables are already included in the model. They do 
not discuss forming conhdence intervals for the selected variables. Section 2.5.2 establishes 
a nonasymptotic test for the same null hypothesis, while only assuming A is in general 
position. 

In contrast, we provide a test that is exact, allows for arbitrary A, and arbitrary design 
matrix A. By extension, we do not make any assumptions on n and p, and do not require 
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the lasso to be a consistent estimator of (3^. Furthermore, the compntational expense to 
conduct our test is negligible compared to the cost of obtaining the lasso solution. 

Like all of the preceding works, our test assumes that the noise variance is known 
or can be estimated. In the low-dimensional setting p n, can be estimated from the 
residual sum-of-squares of the saturated model. Strategies in high dimensions are discussed 
in Fan et ah (2012) and Reid et al. (2013). In Section 2.8, we also provide a strategy for 
estimating based on the framework we develop. 

2.1.3 Outline of Chapter 

We begin by defining several important quantities related to the lasso in Section 2.2; most 
notably, we dehne the selected model M in terms of the active set of the lasso solntion. 
Section 2.3 provides an alternative characterization of the selection procedure for the lasso 
in terms of affine constraints on y, i.e.. Ay < b. Therefore, the distribution of y conditional 
on the selected model is the distribution of a Gaussian vector conditional on its being in 
a polytope. In Section 2.4, we generalize and show that for y ~ N{p, S), the distribution 
of rj^y I Ay < 6 is roughly a truncated Gaussian random variable, and derive a pivot for 
rj^p. In Section 2.5, we specialize again to the lasso, deriving conhdence intervals for ^ 
and hypothesis tests of the selected model as special cases of Section 2.6 presents an 
example of these methods applied to a dataset. 

In Section 2.7, we consider a refinement that produces narrower conhdence intervals. 
Finally, Section 2.8 collects a number extensions of the framework. In particnlar, we demon¬ 
strate: 

• modihcations needed for the elastic net (Zou and Hastie, 2005). 

• different norms as test statistics for the “goodness of ht” test discnssed in Section 2.5. 

• estimation of ci^ based on htting the lasso with a sufficiently small A. 

• composite null hypotheses. 

• htting the lasso for a sequence of A values and its effect on onr basic tests and intervals. 


CHAPTER 2. SELECTIVE INFERENCE FOR THE LASSO 


2.2 Preliminaries 

Necessary and sufficient conditions for (/3, z) to be solutions to the lasso problem (2.1.2) are 
the Karush-Kuhn-Tucker (KKT) conditions: 


X^(X/3 -y) + Xz = G, 

(2.2.1) 

{ sign(/3i) if A / 0 

(2.2.2) 

Zi€ < 

[[-1,1] ifA = 0 



where z := d\ \ ■ ||i(/3) denotes the subgradient of the ii norm at /3. We consider the active 
set (Tibshirani, 2013) 

M = {i p} :\zi\ = l}, (2.2.3) 

so-named because by examining only the rows corresponding to M in (2.2.1), we obtain the 
relation 

Xl{y-XP) = -Xz^, 

where is the submatrix of X consisting of the columns in M. Hence 

\Xliy-XP)\ = X, 

i.e. the variables in this set have equal (absolute) correlation with the residual y — Xp. 
Since Zi G { — 1,1} for any /3j ^ 0, all variables with non-zero coefficients are contained in 
the active set. 

Recall that we are interested in inference for rj^y in the model (2.1.1) for some direction 
T] = rjj^ G IR”, which is allowed to depend on the selected variables M. In most applications, 
we will assume y = Xj3^, although our results hold even if the linear model is not correctly 
specified. 

A natural estimate for rj^y is rj^y. As mentioned previously, we allow y = to depend 
on the random selection procedure, so our goal is post-selection inference based on 

y'^y I (M = Mj. 


For reasons that will become clear, a more tractable quantity is the distribution conditional 
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on both the selected variables and their signs 

ify I {(M,Zj^) = {M,zm)}- 

Note that confidence intervals and hypothesis tests that are valid conditional on the finer 
partition = {M,zm)} will also be valid for {M = M}, by summing over the 

possible signs zm'- 

P( . I M = M) = ^P( • I {M,z^) = {M,zm)) nzM = ZM I M = M). 

Zm 

From this, it is clear that controlling P( • | {M,Zj^) = {M,zm)) to be, say, less than a (as 
in the case of hypothesis testing) will ensure P(-|M = M)<a. 

It may not be obvious yet why we condition on {(M, Zj^) = (M, zm)} instead of {M = 
M}. In the next section, we show that the former can be restated in terms of affine 
constraints on y, i.e., {Ay < b}. We revisit the problem of conditioning only on {M = M} 
in Section 2.7. 

2.3 Characterizing Selection for the Lasso 

Recall from the previous section that our goal is inference conditional on {{M,Zj^) = 
{M,zm)}- In this section, we show that this selection event can be rewritten in terms of 
affine constraints on y, i.e., 

{(M, Zj^) = (M, zm)} = {A{M, ZM)y < b{M, zm)} 

for a suitable matrix A{M, zm) and vector b{M, zm)- Therefore, the conditional distribution 
y I {(-^)^m) = (M, Zm)} is simply y | {A{M, ZM)y < b{M, zm)}- This key theorem follows 
from two intermediate results. 

Lemma 2.3.1. Without loss of generality, assume the columns of X are in general posi¬ 
tion. Let M C { 1 ,.. . ,p} and zm £ {~1) be a candidate set of variables and signs, 
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respectively. Define 

U = U{M, zm) := {XljXM)-\xljy - \zm) (2.3.1) 

W = W(M, zm) := Xfi^{Xl)+ZM + ^Xfi^il - PM)y. (2.3.2) 

Then the selection procedure can be rewritten in terms of U and W as: 

{{M,z^) = {M,zm)} = {sign{U {M, zm)) = zm, ||W^(M, zm)|Ioo < 1} (2.3.3) 

Proof. First, we rewrite the KKT conditions (2.2.1) and (2.2.2) by partitioning them ac¬ 
cording to the active set M: 

“ y) + = C) 

“ y) + ~ 

sign(/3xi) = e (-1, !)• 

Since the KKT conditions are necessary and sufficient for a solution, we obtain that 
{(M, Zj^) = (M, Zm)} if and only if there exist U and W satisfying: 


Xm{^mU — y) + ^zm — 0 

(2.3.4) 

X^^i^MU - y) + XW = 0 

(2.3.5) 

sign(?7) = ZM,W € (-1,1). 

(2.3.6) 


Solving (2.3.4) and (2.3.5) for U and W yields the formulas (2.3.1) and (2.3.2). Finally, the 
requirement that U and W satisfy (2.3.6) yields (2.3.3). □ 

Lemma 2.3.1 is remarkable because it says that the selection event {{M,Zj^) = (M, zm)} 
is equivalent to affine constraints on y. To see this, note that both Lf and W are affine func¬ 
tions of y, so {sign([/) = zm, llff^lloo < f} written as affine constraints {A{M, ZM)y < 

h{M,ZM)}- The following proposition provides explicit formulas for A and b. 

Proposition 2.3.2. Let U and W be defined as in (2.3.1) and (2.3.2). Then: 

{ /klo(M, zm)\ ^ /6o(M, ^m)\ 1 
\ (M, zm)}^ V^i(M, zm)}] 


{sign(t/) = zm, ||hF||^ < 1} 


(2.3.7) 
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where Aq^Iq encode the “inactive” constraints {||ll^||oo < ^.nd Ai,bi encode the “active” 
constraints {sign(t/) = zm}- These matrices have the explicit forms: 


Ao{M,zm) = j( bo{M,ZM) 

^ \-xf^{i - Pm)J 

Ai{M, zm) = - diag{zM)iNMXM)~^Xl[ bi{M, zm) 


fl-Xf^{Xl) + ZM\ 
\l+Xf^{Xl) + ZM) 
-\diag{zM){X'ljXM)~^ ZM 


Proof. First, we write 

{sign(C/) = zm} = {diag{zM)U > 0}. 

From here, it is straightforward to derive the above expressions from the definitions of U 
and W given in (2.3.1) and (2.3.2). □ 

Combining Lemma 2.3.1 with Proposition 2.3.2, we obtain the following. 

Theorem 2.3.3. The selection procedure can be rewritten in terms of affine constraints on 

y: 

{{M,Zj^) = {M,zm)} = {A{M,ZM)y < 6(M, zm)}- 

To summarize, we have shown that in order to understand the distribution of y ~ 
N{fj.,T,) conditional on the selection procedure {(M,z^) = {M,zm)}, it suffices to study 
the distribution of y conditional on being in the polytope {Ay < b}. The next section 
derives a pivot for rj'^jj, for such distributions, which will be useful for constructing confidence 
intervals and hypothesis tests in Section 2.5. 


2.4 A Pivot for Gaussian Vectors Subject to Affine Con¬ 
straints 

The distribution of a Gaussian vector y ~ N(p, S) conditional on affine constraints {Ay < 
b}, while explicit, still involves the intractable normalizing constant P{Ay < b). In this 
section, we show that one dimensional projections of p, (i.e., rj'^p) are univariate truncated 
normal, which will allow us to form tests and intervals for 
The key to deriving this pivot is the following lemma: 

Lemma 2.4.1. The conditioning set can be rewritten in terms ofrj^y as follows: 


{Ay <b} = {V (y) < ify < V+(y), V°(y) > 0} 
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where 


V 

v+ 

v° 


AEt] 

rf'Erj 


V-(y) 

V+(y) 

V\y) 


max • 

j\ aj<{) 


mm • 

j\ aj>0 


(Ay)j + ajrj'^y 

ttj 

{Ay)j + ajy'^y 


min bj — {Ay)j 

j: aj=0 


(2.4.1) 

(2.4.2) 

(2.4.3) 

(2.4.4) 


Furthermore, {V~^,V ,V^) is independent of if"y. Then, rf"y conditioned on Ay < b and 
0^~^iy)j'^~iy)) = (^'*')'f^~); ho.s a truncated normal distribution, i.e. 


'q^y\{Ay<b,V^{y) = v~^,V {y) = v } ^ TN{rj^ pL,rj^Eri,v ,v^). 


However, before stating the proof of this lemma, we show how it is used to obtain our 
main result. 

Theorem 2.4.2. Let denote the CDF of a N{y, cr^) random variable truncated to the 
interval [a,b], i.e.: 

(2.4.5) 


pM]. X ^ 4»((x-^)/(T)-^((a-^)/(j) 
< h ((6 — /i)/lT) — <h((a — ;U)/(T) 


where $ is the CDF of a A^(0,1) random variable. Then “Is a pivotal 

quantity, conditional on {Ay < b{: 


I {Ay < 6} ~ Unif(0,1) 


(2.4.6) 


where V andV~^ are defined in (2.4.2) and (2.4.3). 

Proof. By Lemma 2.4.1, p'^y | { Ay < b, V~^{y) = v~^,V~{y) = u“} ~ TN{p^p,, p'^Ep, v~,v~^). 
We apply the CDF transform to deduce 


F 


V 


(p'^y)\{Ay <b,V^{y) =v+,V {y) = v } 


is uniformly distributed. By integrating over (V''"(y) = v~^,V (y) = v ), we conclude 
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I ^ H ~ Unif(0, 1). Let G{v+, v ) = P(V+ <v+,V < y- 


P <s \ Ay <b) 


= I y)<s\Ay< 6, V+(y) = V (y) = 


dG{v~^,v ) 

= y s dG{v~^, v~) 


= s. 


We now prove Lemma 2.4.1. 

Proof. The linear constraints Ay < b are equivalent to 

Ay - E[Ay \ y'^y] <b- E[Ay \ y^y]. 


Since conditional expectation has the form 


E[Ay I y^y] = Ay + a{y'^y - y'^y), a = 


AEy 

y'^Ey' 


(2.4.7) simplifies to Ay — b — ay'^y < —ay'^y. Rearranging, we obtain 


1 


V y>—{bj- {Ay)j + ajy y) 


Oj < 0 


y^y < —{bj - {Ay)j + ajy^y) 

Qj 

0<bj - {Ay)j 


Oj > 0 


aj = 0 . 


We take the max of the lower bounds and min of the upper bounds to deduce 

1 1 
max —{bj - {Ay)j + ajy'^y) < y'^y < min —{bj - {Ay)j + Ojy^y) 

j:aj<^ aj j:aj>0 Qj 


V-{y) 


V+{y) 


Ay < b). 


□ 


(2.4.7) 


Since y is normal, bj — {Ay)j + Ojy'^y, j 


l,...,m are independent of y'^y. Hence 
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(V"^(?/), V~{y), V°(y)) are also independent of 77 ^ 7 /. 

To complete the proof, we must show rj'^y given Ay < b, {V~^{y),V~(y)) = {v^,v~) is 
truncated normal. 


P {y'^y < s \Ay <b,V^iy) = v^,V (y) = v ,) 

= P [ri'^y < s < rj'^y < v^,V~^{y) = v~^,V~{y) 
P [rj'^y < s,v~ < rf'y < v^\V~^{y) = v~^,V~{y) 
P {v~ < rj'^y < v+\V~^{y) = u+, V~{y) = f 


P {y'^y < s, u < y'^y < 
P {v~ < y'^y < 7;+) 


= P {y'^y < s 


= v-,V^{y)>0) 
= v-,V°{y)>0) 

-,V0(7/) >0) 

< y^y < 


where the second to last equality follows from the independence of (V'’“,V , V^) and y'^y. 
This is the CDF of a truncated normal. □ 


Although the proof of Lemma 2.4.1 is elementary, the geometric picture gives more 
intuition as to why V"'' and V~ are independent of y'^y. Without loss of generality, we 
assume ||r /||2 = 1 and y ~ N{y,I) (since otherwise we could replace y by T,~^y). Now we 
can decompose y into two independent components, a 1 -dimensional component y'^y and 
an [n — l)-dimensional component orthogonal to y: 


y = y'^y + Prj^y- 


The case of = 2 is illustrated in Figure 2.1. V~ and V'*' are independent of y^y, since 
they are functions of only, which is independent of y^y. 

In Figure 2.2, we plot the density of the truncated Gaussian, noting that its shape 
depends on the location of y relative to [a, b] as well as the width relative to a. 

2.4.1 Adaptive choice of y 

For the applications to forming confidence intervals and significance testing, we will need 
choices of y that are adaptive, or dependent on y. We will restrict ourselves to functions 
y that are functions of the partition, 77 ( 77 ) = f{M{y)). This choice of functions includes 
77 ( 7 /) = X'^P' Ej which is used for forming confidence intervals in Section 2.5. 
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Figure 2.1: A picture demonstrating that the set {Ay < b} can be characterized by {V < 
rj'^y < V"^}. Assuming S = / and \ \r ]\\2 = 1, V~ and V"*" are functions of P^±y only, which 
is independent of rfy. 


Theorem 2.4.3. Let rj : R" —)> R"' be a function of the form rj{y) = f{M{y)), then 


NV-(y),V+(y)] 

vivTu, vivV^viy) 


{viy)'^y) ~ Unif(0 ,l). 


Proof. We can expand F with respect to the partition, 


(M,s) 

V f (dto)“:'',w4,(,) < AVv) = m) p {m(v) = M 

{M,s) 

E f WNy'l s *!*{!') = m) p 

{M,s) 


fV-{y),V+{y)] 


= M 
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Figure 2.2: The density of the truncated Gaussian with distribution ^ depends on 

the width of relative to a as well as the location of [i relative to When 

/i is firmly inside the interval, the distribution resembles a Gaussian. As /U varies outside 
the density begins to converge to an exponential distribution with mean inversely 
proportional to the distance between ^ and its projection onto 

Using Theorem 2.4.2, P ^ t\M{y) = M^ =t. Thus 

< *) = E ‘P = «) 

(M,s) 

= t ^ P (^M{y) = m) 

{M,s) 

= t. 


This shows that iviyfy) ~ Unif(0,1). □ 

2.5 Application to Inference for the Lasso 

In this section, we apply the theory developed in in Sections 2.3 and 2.4 to the lasso. In 
particular, we will construct confidence intervals for the active variables and test the chosen 
model based on the pivot developed in Section 2.4. 
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Figure 2.3: Histogram and empirical distribution of obtained by sampling 

Tj TJ 

y ~ N{y, S) constrained to {Ay < b}. The distribution is very close to Unif(0,1) as shown 
in Theorem 2.4.2. 


To summarize the developments so far, recall that our model says that y N{fj.,a^I). 
The distribution of interest is y \ {(M, = (M, zm){- By Theorem 2.3.1, this is equivalent 

to y I {A{M^zm)v < &(Tf, zm)} defined in Proposition 2.3.2. Now we can apply Theorem 
2.4.2 to obtain the (conditional) pivot 

I = {M,zm)} ~ Unif(0,l) (2.5.1) 

for any y, where V~ and V~^ are defined in (2.4.2) and (2.4.3). Note that A{M,zm) and 
6(M, zm) appear in this pivot through V~ and V"*". This pivot will play a central role in all 
of the applications that follow. 

2.5.1 Confidence Intervals for the Active Variables 

In this section, we describe how to form confidence intervals for the components of I3^ = 
u. If we choose 

( 2 - 5 - 2 ) 

then yjy = jj so the above framework provides a method for inference about the 
^•th yariable in the model M. Note that this reduces to inference about the true if 
M D S := {j : / 0}, as discussed in Section 2.1. Conditions under which this holds 

are well known in the literature, cf. Biihlmann and van de Geer (2011), and provided in 
Section 2.11. 
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By applying Theorem 2.4.2, we obtain the following (conditional) pivot for 

I ~unif(0,l). 

Note that j and r]j are both random—but only through M, a quantity which is fixed after 
conditioning—so Theorem 2.4.2 holds even for this “random” choice of rj. The obvious way 
to obtain an interval is to “invert” the pivot. In other words, since 


P ( I < ^ “ f I ) = a, 


one can define a (1 — a) (conditional) confidence interval for ^ as 


^ M,j 


9 — p* , 


Wvj 


ivjy) < 1 



In fact, F is monotone decreasing in 0^ j, so to find its endpoints, one need only solve for 
the root of a smooth one-dimensional function. The monotonicity is a consequence of the 
fact that the truncated Gaussian distribution is a natural exponential family and hence has 
monotone likelihood ratio in /i. The details can be found in Appendix 2.10.1. 

We now formalize the above observations in the following result, an immediate conse¬ 
quence of Theorem 2.4.2. 

Corollary 2.5.1. Let rjj be defined as in (2.5.2), and let Li, = Li{r]j, M, and Ui = 
Ui{r]j, M, Zj^) be the (unique) values satisfying 


p[v-,v+] 

Li, 


[rijV) = 1 - a 


^[V-,V+] 

uL 



= a 


Then [Li,,Lffi\ is a {1 — a) confidence interval for rjJ g,, conditional on {M,Zj^): 


p e [Li, Ui] I {(M, z^) = (M, zm)}) =1-0. (2.5.3) 


The above discussion has focused on constructing intervals for a single j. If we repeat 
the procedure for each j G M, our intervals in fact control the false coverage rate (FCR) of 
Benjamini and Yekutieli (2005). 
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Corollary 2.5.2. For each j G M, 


P 




1 — a. 


(2.5.4) 


Furthermore, the FCR of the intervals 



is a. 


If are not near the boundaries [V“, V'*'], then the intervals will be relatively short. 
This is shown in Figure 2.4. Figure 2.5 shows two simulations that demonstrate our intervals 
cover at the nominal rate. We leave an exhaustive study of such intervals for the lasso to 
future work, noting that the truncation framework described can be used to form intervals 
with exact coverage properties. 




Figure 2.4: Upper and lower bounds of 90% confidence intervals based on [a, b] = [—3cj, 3cj] 
as a function of the observation x/a. We see that as long as the observation xja is roughly 
0.5(7 away from either boundary, the size of the intervals is comparable to an unadjusted 
confidence interval. 


2.5.2 Testing the Lasso-Selected Model 

Having observed that the lasso selected the variables M, another relevant question is 
whether it has captured all of the signal in the model, i.e.. 


^0 : = 0 . 


( 2 . 5 . 5 ) 
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Figure 2.5: 90% confidence intervals for for a small (n = 100, p = 50) and a large 
(n = 100, p = 200) uncorrelated Gaussian design, computed over 25 simulated data sets. 
The true model has five non-zero coefficients, all set to 5.0, and the noise variance is 0.25. 
A green bar means the confidence interval covers the true value while a red bar means 
otherwise. 

We consider a slightly more general question, which does not assume the correctness of the 
linear model p = and also takes into account whether the non-selected variables can 
improve the fit: 

Ho-.Xl^^{I-Pj^)p = 0. (2.5.6) 

This quantity is the partial correlation of the non-selected variables with p, adjusting for 
the variables in M. This is more general because if we assume p = X/3^ for some and 
X is full rank, then rejecting (2.5.6) implies that there exists i G supp(/3*^) not in M, so we 
would also reject (2.5.5). 

The natural approach is to compare the observed partial correlations X'^j^{I — PM)y to 
0. However, the framework of Section 2.4 only allows tests of /r in a single direction rj. To 
make use of that framework, we can choose p such that it selects the maximum magnitude 
of X'^jy^{I — PM)y- In particular, this direction provides the most evidence against the null 
hypothesis of zero partial correlation, so if the null hypothesis cannot be rejected in this 
direction, it would not be rejected in any direction. 
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Letting f := argmax^ - PM)y\ and Sj := sign{ejX'^j^{I - PM)y), we set 

Vj* = - PM)X-Mej*, (2.5.7) 

and test Hq : = 0. However, the results in Section 2.4 cannot be directly applied to 

this setting because j* and sj* are random variables that are not measurable with respect 
to {M,Zj^). 

To resolve this issue, we propose a test conditional not only on (M, Zj^), but also on the 
index and sign of the maximizer: 


{{M, Zj^) = (M, zm), if, Sj*) = (j, s)}. (2.5.8) 


A test that is level a conditional on (2.5.8) for all (M, zm) and (j, s) is also level a conditional 
on {(M,Zj^) = (M,zm)}- 

In order to use the results of Section 2.4, we must show that (2.5.8) can be written in the 
form A(M, ZM,j, s)y < b{M, ZM,j, s). This is indeed possible, and the following proposition 
provides an explicit construction. 

Proposition 2.5.3. Let Ao,bQ, Ai,bi be defined as in Proposition 2.3.2. Then: 



✓ 

^ AfM, zm)'^ 


^boiM, zm)'^ 


= iM,ZM), if,Sj*) = (j,s)} = < 


Ai{M, zm) 

y < 

bi{M, zm) 




\A2iM,j,s)j 


1 0 J 

> 


where A 2 iM,j,s) is defined as 


M{M,j,s) 


(DfM)\ 

\SfM)) 




and Dj and Sj are (|M| — 1) x |M| operators that compute the difference and sum, respec¬ 
tively, of the element with the other elements, e.g.. 


Di 


(l -1 
1 -1 


\ 




/l 1 
1 1 


\ 


VI 


V 


VI 


V 
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Proof. The constraints {Aqu < bo} and {Aiy < 61 } come from Proposition (2.3.2) and 
encode the constraints {(M, Zj^) = (M, zm)}- We show that the last two sets of constraints 
encode = (j,s)}. 

Let r := Xfj^{I — PM)y denote the vector of partial correlations. If s = +1, then 
\rj\ > \ri\ for all i j if and only if rj — > 0 and rj + r* > 0 for all i j. We can write 
this as DjV > 0 and SjV >0. If s = —1, then the signs are flipped: Djr < 0 and Sjr < 0. 
This establishes 

r < o| = {A 2 y < 0}. 

□ 


{{j\si.) = {j,s)} = {-s 


D, 

.Si 


Because of Proposition 2.5.3, we can now obtain the following result as a simple con¬ 
sequence of Theorem 2.4.2, which says that -^d^ 2 ||^ J|| 2 (^J*y) ~ Unif(0,l), conditional on 
the set (2.5.8) and Hq. We reject when is large because (•) is 

monotone increasing in the argument and yj, // is likely to be positive under the alternative. 


Corollary 2.5.4. Let Hq and rjj* be defined as in (2.5.7). 
when 


{^(l!'X]||2(^J*y) > l-a} 


Then, the test which rejects 


is level a, conditional on {(M, Zj^) = (M, zm)-, {j*, Sj*) = {j, s)}. That is, 


^ [^ 0 ! > 1 - « I = {M,Zm): = (j, s)} H i^oj = «• 


In particular, since this holds for every {M, zm, test also controls Type I error 

conditional only on {M,Zj.^), and unconditionally: 


»o) = “' 

Figures 2.6 and 2.7 show the results of four simulation studies that demonstrate that 
the p-values are uniformly distributed when 77 o,a is true and stochastically smaller than 
Unif(0,1) when it is false. 
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Figure 2.6: P-values for Hq x at various A values for a small (n = 100, p = 50) and a large 
(n = 100, p = 200) uncorrelated Gaussian design, computed over 50 simulated data sets. 
The true model has three non-zero coefficients, all set to 1.0, and the noise variance is 
2.0. We see the p-values are Unif(0,1) when the selected model includes the truly relevant 
predictors (black dots) and are stochastically smaller than Unif(0,1) when the selected 
model omits a relevant predictor (red dots). 


2.6 Data Example 

We illustrate the application of inference for the lasso to the diabetes data set from Efron 
et al. (2004). First, all variables were standardized. Then, we chose A according to the 
strategy in Negahban et al. (2012), A = 2 E(||X^e||oo), using an estimate of a from the full 
model, resulting in A ~ 190. The lasso selected four variables: BMI, BP, S3, and S5. 

The intervals are shown in Figure 2.8, alongside the unadjusted confidence intervals 
produced by fitting OLS to the four selected variables, ignoring the selection. The latter is 
not a valid confidence interval conditional on the model. Also depicted are the confidence 
intervals obtained by data splitting; that is, if one splits the n observations into two halves, 
then uses one half for model selection and the other for inference. This is a competitor 
method that also produces valid confidence intervals conditional on the model. In this case, 
data splitting selected the same four variables, and the confidence intervals were formed 
based on OLS on the half of the data set not used for model selection. 

We can make two main observations from Figure 2.8. 
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Figure 2.7: P-values for Hq \ at various A values for a small (n = 100, p = 50) and a large 
(n = 100, p = 200) correlated {p = 0.7) Gaussian design, computed over 50 simulated data 
sets. The true model has three non-zero coefficients, all set to 1.0, and the noise variance is 
2.0. Since the predictors are correlated, the relevant predictors are not always selected first. 
However, the p-values remain uniformly distributed when Hq \ is true and stochastically 
smaller than Unif(0,1) otherwise. 


1. The adjusted intervals provided by our method essentially reproduces the OLS inter¬ 
vals for the strong effects, whereas data splitting results in a loss of power by roughly 
a factor of \/2 (since only n/2 observations are used in the inference). 

2. One variable, S3, which would have been deemed significant using the OLS intervals, is 
no longer significant after adjustment. This demonstrates that taking model selection 
into account can have substantive impacts on the conclusions that are made. 


2.7 Minimal Post-Selection Inference 

We have described how to perform post-selection inference for the lasso conditional on both 
the active set and signs {{M,Zj^) = {M,zm)}- However, recall from Section 2.1 that the 
goal was inference conditional solely on the model, i.e., {M = M}. In this section, we extend 
our framework to this setting, which we call minimal post-selection inference because we 
condition on the minimal set necessary for the random rj to be measurable. This results in 
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Figure 2.8: Inference for the four variables selected by the lasso (A = 190) on the diabetes 
data set. The point estimate and adjusted conhdence intervals using the approach in Section 
2.5 are shown in blue. The gray show the OLS intervals, which ignore selection. The yellow 
lines show the intervals produced by splitting the data into two halves, forming the interval 
based on only half of the data. 


more precise confidence intervals at the expense of greater computational cost. 
To this end, we note that {M = M} is simply 


u {{M,z^) = iM,ZM)}, 

2mS{—1,1}TI 

where the union is taken over all choices of signs. Therefore, the distribution of y conditioned 
on only the active set {M = M} is a Gaussian vector constrained to a union of polytopes 

y\ U {A{M,ZM)y <b{M,ZM)}, 

where A{M,zm) and b{M,ZM) are given by (2.3.2). 

To obtain inference about rf"y, we follow the arguments in Section 2.4 to obtain that 
this conditional distribution is equivalent to 


v'^y I U {Km( y) ^ ^ Km( y)> (y) ^ 


(2.7.1) 
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where V^, are defined according to (2.4.2), (2.4.3), (2.11.4) with A = A{M,zm) 

and b = b{M, zm)- Moreover, all of these quantities are still independent of rj^y, so instead 
of having a Gaussian truncated to a single interval [V“, V'*'] as in Section 2.4, we now have 
a Gaussian truncated to the union of intervals geometric intuition is 

illustrated in Figure 2.9. 



Figure 2.9: A picture demonstrating the effect of taking a union over signs. The polytope 
in the middle corresponds to the {M,Zj^) that was observed and is the same polytope as 
in Figure 2.1. The difference is that we now consider potential {M,zm) in addition to the 
one that was observed. The polytopes for the other {M,zm) which have the same active 
set M are red. The conditioning set is the union of these polytopes. We see that for y to 
be in this union, rf"y must be in The key point is that all of the and 

are still functions of only and so are independent of rf"y. 


Finally, the probability integral transform once again yields a pivot: 


y) I {M = M} ~ Unif(0,1). 

It is now more useful to think of the notation of F as indicating the truncation set C C IR: 

<!>((—oo, x] n C) 




HC) 


(2.7.2) 


where is the law of a A^(0,1) random variable. We summarize these results in the following 
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Figure 2.10: Comparison of the minimal and simple intervals as applied to the same simu¬ 
lated data set for two values of A. The simulated data featured n = 25, p = 50, and 5 true 
non-zero coefficients; only the first 20 coefficients are shown. (We have included variables 
with no intervals to emphasize that inference is only on the selected variables.) We see that 
the simple intervals are virtually as good as the minimal intervals most of the time; the 
advantage of the minimal intervals is realized when the estimate is unstable and the simple 
intervals are very long, as in the right plot. 


theorem. 

Theorem 2.7.1. Let CDF of a normal truncated to the union of intervals 

UJojj&i], i-e., given by (2.7.2). Then: 

I {M = M} ~ Unif(0,1), (2.7.3) 

where V^(y) and V^^(y) are defined in (2.4.2) and (2.4.3) with A = A{M,zm) and b = 
b{M, zm)- 

The derivations of the confidence intervals and hypothesis tests in Section 2.5 remain 
valid using (2.7.3) as the pivot instead of (2.5.1). Figure 2.10 illustrates the effect of minimal 
post-selection inference in a simulation study, as compared with the “simple” inference 
described previously. The intervals are similar in most cases, but one can obtain great 
gains in precision using the minimal intervals when the simple intervals are very wide. 

However, the tradeoff for this increased precision is greater computational cost. We 
computed and for all zm £ {~1) which is only feasible when \M\ is fairly 

small. In what follows, we revert to the simple intervals described in Section 2.5, but 
extensions to the minimal inference setting are straightforward. 
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2.8 Extensions 

2.8.1 Elastic net 

One problem with the lasso is that it tends to select only one variable out of a set of 
correlated variables, resulting in estimates which are unstable. The elastic net (Zou and 
Hastie, 2005) adds an £2 penalty to the lasso objective in order to stabilize the estimates: 

/3^ = argmin J \\y - X/3\\l + A ||^||i + ^ \\l3\\l. (2.8.1) 

y 2 2 

Using a nearly identical argument to the one in Section 2.3, we see that necessary and 
sufficient conditions for {(M, Z£j) = (M, zm)} are the existence of f7(M, zm) and W (M, zm) 
satisfying 


{N]^Xm + 'yI)U — X]^y + Xzm — 0 
xImXmU - X^mV + ait = 0 

sign(U) = ZM, W € (- 1 , 1 ). 


Solving for U and W, we see that the selection event can be written 


{(^>%) = {M,zm)} 


fAo{M,ZM)\ ^ Ao(AT,zm )\1 

\Ai{M,zm)} \bi{M, Zm)/ J 


( 2 . 8 . 2 ) 


where Aq, Ai, bo, and 61 are the same as in Proposition 2.3.2, except replacing (XJ^Xm) 
which appears in the expressions through Pm and {X'£j)~^, by the “damped” version {X'^Xm+ 

Having rewritten the selection event in the form (2.8.2), we can once again apply the 
framework of Section 2.4 to obtain a test for the elastic net conditional on this event. 


2.8.2 Alternative norms as test statistics 

In Section 2.5.2 we used the test statistic 


T — 

no — 
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and its conditional distribntion on {{M,Zj^) = {M,zm)} to test whether we had missed 
any large partial correlations in using M as the estimated active set. If we have indeed 
missed some variables in M there is no reason to snppose that the mean of X'^^{I — Pm)!! 
is sparse; hence the ioo norm may not be the best norm to use as a test statistic. 

In principle, we conld have used virtually any norm, as long as we can say something 
about the distribution of this norm conditional on {(M, = (M, zm)}- Problems of this 

form are considered in Taylor et al. (2013). For example, if we consider the quadratic 

T 2 = - PM)y \\2 

the general approach in Taylor et al. (2013) derives the conditional distribution of T 2 con¬ 
ditioned on 

r/2 = aTgmaxr]'^{X^i^{I - PM)y)- 
M2<1 

In general, this distribntion will be a snbject to random truncation as in Section 2.4 
(see the group lasso examples in Taylor et al. (2013)). Adding the constraints encoded by 
{(M, Zj^) = {VI, Zm)} affects only the random trnncation [V ,V''“]. 

2.8.3 Estimation of cr^ 

As noted above, all of our results rely on a reliable estimate of cr^. While there are several 
approaches to estimating cr^ in the literature, the truncated Gaussian theory described in 
this work itself provides a natural estimate. 

Suppose the linear model is correct (/r = Xj5^). Then, on the event {M = M, E 2) S}, 
which we assume, the residual 

{I - PM)y 

is a (multivariate) truncated Ganssian with mean 0, with law 

Fc,a^{B) = '^{Z ^B\Z eC), Z~iV(0,(j2/). 

As (T^ , one obtains a one-parameter exponential family with density 

dz 
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and natural parameter a = cr^/2. On the event {(M, -2^) = {M, zm)}, we set 

C = {y: A{M, ZM)y < b{M, zm)} , 

and then choose a (or equivalently, cr^) to satisfy the score equation 

^cMWZWl) = m - PM)y\\l (2.8.3) 

This amounts to a maximum likelihood estimate of The expectation on the left is 
generally impossible to do analytically, but there exist fast algorithms for sampling from 
V(j^ 2 , c.f. Geweke (1991); Rodriguez-Yam et al. (2004). A rough outline of a naive version 
of such algorithms is to pick a direction such as e* one of the coordinate axes. Based on the 
current state of Z, draw a new entry for the Zi from the appropriate nnivariate truncated 
normal determined from the cutoffs described in Section 2.4. We repeat this procedure to 
evaluate the expectation on the left, and use gradient descent to find 

2.8.4 Composite Null Hypotheses 

In Section 2.5, we considered hypotheses of the form Hq : = 0, which said that the 

partial correlation of the variables in — M with y, adjnsting for the variables in M, was 
exactly 0. This may be unrealistic, and in practice, we may want to allow some tolerance 
for the partial correlation. 

We consider testing instead the composite hypothesis 

Ho : < 6o. (2.8.4) 


The following result characterizes a test for Hq- 

Proposition 2.8.1. The test which rejects when > I — o is exact level 

a. 
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Proof. Let <5 := Define Ts^ := Then: 

Type I error := sup P5(T5o > I — oi) 

\5\<6o 

= a 

Next, we have that i-e-, the infimum is achieved at <5 = (5o, so 

calculating Tg^^ is a simple matter of evaluating T^p. This follows from the fact that Fs is 
monotone decreasing in 5 (c.f. Appendix 2.10.1). 

Finally, the Type I error is exactly a because the reverse inequality also holds: 

Type I error > P 5 p(T 5 p > 1 — a) = a. 


□ 

Although the test is exact level a, the significance level of a test for a composite null 
is a “worst-case” Type I error; for most values of ^ such that < 5o) the Type I error 

will be less than a, so the test will be conservative. Of course, what we lose in power, we 
gain in robustness to the assumption that rj^^ = 0 exactly. 

2.8.5 How long a lasso should you use? 

Procedures for fitting the lasso, such as glmnet (Friedman et ah, 2010b), solve (2.1.2) for a 
decreasing sequence of A values starting from Ai = ||A^y||oo- The framework developed so 
far provides a means to decide when to stop along the regularization path, i.e., when the 
lasso has done enough “fitting.” In this section, we describe a path-wise testing procedure 
for the lasso. 

The path-wise procedure is simple. At each value of A: 

1. Solve the lasso and obtain an active set Mx and signs Zj^^. 

2. Test Hqx '■ {I — Pp ){l) = 0 at level a. Rather than being conditional on only 

^ ’ Ex 

{Mx,Zj^J, this test is conditional on the entire sequence of active sets and signs 
{(M™,i"^) = (M™,z™')}, as we describe below. 
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As A decreases, we expect to reject the null hypotheses as the fit improves and stop once 
the first null hypothesis has been accepted. 

To understand the properties of this procedure, we formalize it as a multiple testing 
problem. For each value Ai,...,Am, we test Hq\.. We test these hypotheses sequentially 
and stop after the first hypothesis has been accepted. Implicitly, this means that we accept 
all the remaining hypotheses. 

Our next result shows that this procedure controls the family-wise error rate (FWER) at 
level a. Let V denote that number of false rejections. Then FWER is defined as P(R > 1). 
The practical implication of this result is the model selected by this procedure will be larger 
than the true model with probability a. 

Proposition 2.8.2. The path-wise testing proeedure eontrols FWER at level a. 

Proof. Let and denote the complete sequence of active sets and signs at Ai,..., Am, 
i.e.. 


We seek to control the family-wise error rate (EWER) when testing the hypotheses 
Hqx.^,...,Hq\^, i.e., ¥{V > 1). We partition the space over all possible sequences M™' 
and z'^: 

P(R>1)= P > 1 I P . 

Since z^) = (TL™', = 1, we can ensure FWER < a by ensuring 

P (r > 1 I < a for any {M^,z^). 

Let Afc denote the first A* for which Hq \. is true. Then the event R > 1 is equivalent to 
the event that we reject Hq^\^ because the preceding hypotheses LIo,Ai, • • •, ^^o,Afe_i are all 
false so we cannot make a false discovery before the hypothesis. Thus 

P (r > 1 I {M^,z^) = {M^,z^)'^ = P (reject Ho,x, \ {M^,z"^) = . 
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Therefore, we can control FWER at level a by ensuring 

P (reject Ho,x \ < a 

for each A G {Ai,..., A*,}. □ 

To perform a test of Hq \ conditioned on {(M™, I™) = (M™, z'^)}, we apply the frame¬ 
work of Section 2.4. Let 

{A{Mi,Si)y < b{Mi,Si)} 

be the affine constraints that characterize the event {(M;,., z\^) = (Mj, Zi)} from Proposition 
2.3.2. The event is equivalent to the intersection of all of these 

constraints: 


' A{AR,zi)' 


b{Mi,zi) 


y < 


A(^Mm, Zm) 


Zftl) 




Now Theorem 2.4.2 applies, and we can obtain the usual pivot as a test statistic. 

2.9 Conclusion 

We have described a method for making inference about y in the linear model based 
on the lasso estimator, where rj is chosen adaptively after model selection. The confidence 
intervals and tests that we propose are conditional on {(M,z^) = {M,zm)}- In contrast 
to existing procedures on inference for the lasso, we provide a pivot whose conditional 
distribution can be characterized exactly (non-asymptotically). This pivot can be used to 
derive confidence intervals and hypothesis tests based on lasso estimates anywhere along 
the solution path, not necessarily just at the knots of the LARS path as in Lockhart et ah 
(2014). Finally, our test is computationally simple: the quantities required to form the test 
statistic are readily available from the solution of the lasso. 
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2.10 Appendix 

2.10.1 Monotonicity of F 

Lemma 2.10.1. Let F^{x) := denote the cumulative distribution function of a 

truncated Gaussian random variable, as defined as in (2.4.5). Then Ffj,{x) is monotone 
decreasing in 

Proof. First, the truncated Gaussian distribution with CDF := is a natural expo¬ 
nential family in /r, since it is just a Gaussian with a different base measure. Therefore, it 
has monotone likelihood ratio in fa. That is, for all /ii > /tq and xi > xq: 

f^loixl) fuoixo) 

where := dF^. denotes the density. (Instead of appealing to properties of exponential 
families, this property can also be directly verified.) 

This implies 


fiMi{xi)ff,fixo) > ff,fixo)fi,fixi) 


Xi > Xq. 


Therefore, the inequality is preserved if we integrate both sides with respect to xq on 
(—oo,x) for X < xi- This yields: 


/X rx 

f^llixl)f^loixo)dxo > / f^ifixo)f^,oixi)dxo 

■OO J —OO 

fv-i ixi)F^g (x) > /mo (xi)F^i(x) 


X < Xl 
X < Xl 


Now we integrate both sides with respect to xi on (x, oo) to obtain: 


(1 - F^^(x))T^o(x) > (1 - F^fix))F^,fix) 


which establishes F^fix) > F^j^fix) for all /ii > ^o- 


□ 


2.11 Lasso Screening Property 

In this section, we state some sufficient conditions that guarantee support(/3‘^) C support(/3). 
Let M = support(/3^) and M C support(/3). The results of this section are well known in 
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the literature and can be found in (Blihlmann and van de Geer, 2011, Chapter 2.5). 


Definition 2.11.1 (Restricted Eigenvalue Condition). Restricted eigenvalue condition re¬ 
quires that X satisfy 

\\Xv\\l > m||n ||2 

for all V e {x : ||x_m|Ii < 3 ||xm||}- 


Definition 2.11.2 (Beta-min Condition). The beta-min condition requires that for all j G 
M, 


Theorem 2.11.3. Let y = XjS^ + e, where e is subgaussian with parameter a, and ft be 
the solution to 2.1.2 with A = Assume that X satisfies the restricted eigenvalue 

condition, (5^ satisfies the beta-min condition with (Imin = j ^ column 

normalized, ||xj ||2 < y/n. Then M G M. 


Proof. From (Negahban et ah, 2012, Corollary 2), 


1/3 


mV n 


Assume that their is a j such that j E M, but j ^ M. We must have 


P ~ R II2 > ~ 


8(7 s log p 


mV n 


This is a contradiction, so for all j £ M we have j G M. 


□ 


Next we provide a geometric proof of Lemma 2.4.1 which will be useful in the next 
chapter. 

Lemma 2.11.4. The conditioning set can be rewritten in terms ofrj^y as follows: 


{Ay <b} = {V (y) < rfy < V+(y), V°(y) > 0} 
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where 


V 

v+ 


AEr] 

^ ri'^Erj 

V~{y) = max 

j: aj<0 

V^{y) = min — 

j: aj>0 

V^{y) = min bj 

j: aj=0 


{Ay)j + ajTj'^y 

aj 

{Ay)j Vajify 

aj 

i^y)j 


( 2 . 11 . 1 ) 

( 2 . 11 . 2 ) 

(2.11.3) 

(2.11.4) 


Moreover, (V^,V ,V°) are independent of y. 

Proof. Although the proof of Lemma 2.11.4 is elementary, the geometric picture gives more 
intuition as to why V*" and V~ are independent of rf^y. Since S is assumed known, let y = 
E~ 2 y so that y ~ N{E~^y, I). We can decompose y into two independent components: a 
one-dimensional component along fj := and a (p—l)-dimensional component orthogonal 
to t): 

y = yfi + yrj^- 

From Figure 2.1, it is clear that the extent of the set {Ay < b} = {AE'Sy < b} (i.e., V'*' 
and V~) along the direction r) depends only on yfj± and is hence independent of = rj^y. 
We present a geometric derivation below. The values V'*' and V~ are the maximum and 
minimum possible values of rf^y, holding yf^± hxed, while remaining inside the polytope 
AEiy < b. Writing y = cfj + yfj± where c is allowed to vary, and V~ are the optimal 
values of the optimization problems: 


max. / min. ffy = c\\fi \\2 

subject to AS 2 {cfj -|- y^^i) < b 


Rewriting this problem in terms of the original variables rj and y, we obtain: 


max. / min. 
subject to 


c(ry^Sy) 

c(ASry) < 6 - Ay 


Since c is the only free variable, we see from the constraints that the optimal values and 
V~ are precisely those given in (2.4.2) and (2.4.3). □ 






Chapter 3 

Condition-on-Selection Method 


In the previous chapter, we focused on selective inference for the sub-model coefficients 
selected by the lasso by conditioning on the event that lasso selects a certain subset of 
variables. However the procedure we developed is not restricted to the sub-model coef- 
hcients, nor is it restricted to the lasso. In Lee and Taylor (2014), we used the same 
Condition-on-Selection (COS) method for marginal screening, orthogonal matching pur¬ 
suit, and screening-l-lasso variable selection methods. 

In this chapter, we first discuss some definitions and formalism, which will help us 
understand how to generalize the results of Chapter 2 to other selection procedures. In 
Section 3.1, we see that the COS method results in tests that control the selective type 1 
error. Then in Section 3.2, we show how the selection events for several variable selection 
methods such as marginal screening, and orthogonal matching pursuit are affine in the 
response y. For non-affine selection events, we propose a general algorithm in Section 3.3. 
We then describe inference for the full model regression coefficients, provide a method for 
FDR control and establish the asymptotic coverage property in the high-dimensional setting 
in Section 3.4. Finally in Section 3.5, we show how to construct selectively valid confidence 
intervals for regression coefficients selected by the knockoff filter (Foygel Barber and Candes, 
2014). 

3.1 Formalism 

This section closely follows the development in Fithian et al. (2014), which in turn uses the 
COS method developed in earlier works Lee and Taylor (2014); Lee et al. (2013a); Taylor 
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et al. (2014). Our main result of this section is to show that tests constructed using the 
COS method control selective type 1 error. This is the original motivation of Lee and Taylor 
(2014); Lee et al. (2013a) for designing tests with the COS method. 

We start off by defining a valid test in the classical setting. 

Definition 3.1.1 (Valid test). Let H € H be a hypothesis, and (p{y,H) G {0,1} is a test 
of H meaning we reject H if (p{y; H) = 1. H) is a valid test of H if 

Pf i4>{y,H) = 1) <a 

for all F null with respect to H, meaning F G Nh , where Nh is the set of distributions 
null with respect to H. 

For selective inference, there is an analog of type 1 error. 

Definition 3.1.2 (Selective Type 1 Error ). (p{y,H{y)) is a valid test of the hypothesis 
H{y) if it controls the selective type 1 error, 

Pf {(j){y,H{y)) = 1 | F G NH{y)) < a. 


The framework laid out in Chapter 2 proposes controlling the selective type 1 error 
via the COS method. As we showed in the case of confidence intervals for regression 
coefficients and goodness-of-fit tests, by conditioning on the lasso selection event, we are 
guaranteed to control the conditional type 1 error by design, and this implies the control of 
the unconditional type 1 error. We now show that this is not specific to the lasso; in fact 
controlling the conditional type 1 error always controls the unconditional type 1 error in 
Dehnition 3.1.2. 

Definition 3.1.3. Let TL be the hypothesis space. The selection algorithm H : —)■ A 

maps data to hypothesis. This induces the selection event S{H) = {y : H{y) = H}. 

The following definition motivates the construction in Equation (2.5.3). 

Definition 3.1.4 (Condition-on-Selection method). A testcf is constructed via the Condition- 
on-Selection (COS) method if for all F G Nh^ 


Pf {(j{yH,) = l\H{y) = Hi). 


(3.1.1) 
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This means that Hi) controls the conditional type 1 error rate. 

By a simple generalization of the argument in Theorem 2.4.3, we show that using the 
COS method to design a conditional test 3.1.1 implies control of the selective type 1 error 

3.1.2. 

Theorem 3.1.5 (Selective Type 1 Error control). A test constructed using the COS method, 
i.e. satisfies (3.1.1), controls the selective type 1 error meaning 

Pf {cj){y;H{y)) = l\ F £ iV//(j;)) < a. 


Proof. 

\n\ 

PFifiiy; H{y)) = 1\F£ NHiy)) = Pifiiy, H{y)) = 1, H{y) = Hi) \ F £ iV^(,)) 

i=l 

= Y P{<P{y,H{y)) = l,H{y) = Hi\F£NHiy)) + 

i:FeN„. 

Y H{y)) = 1, H{y) = Hi \ F £ Nj^^y)) 

= Y Pi(p{y;H{y)) = l,H{y) = Hi\F£NHiy))+0 

^■■FeNH^ 

= Y Pi<P{y,H{y)) = l\F£NHiy),Hiy) = Hi)F{H{y) = H,\F£NHiy)) 

i■.F&NH^ 

= Y ^{4>{yH^) = l\H{y) = H,)P{H{y)=Hi\F£NH^y)) 

i-.F&Nn^ 

= Y c,^{H{y)=Hi\F£NHi^y)) 

^:FeNH^ 

< a. 


where all of the previous probabilities are with respect to the distribution F. The first equal¬ 
ity is the law of total probability, and the second equality is breaking the sum over disjoint 
sets. Since F 0 Nui, implies PF{H{y) = Hi \ F £ Nn^y)) = 0, so Yjv.f^h, ^{(l>{y, H {y)) = 
l,H{y) = Hi \ F £ NF(y)) = 0, which establishes the third equality. The fourth equal¬ 
ity is the definition of conditional probability, and the fifth follows from noticing that 
{F £ NF(y), H{y) = Hi,F £ Nffi} = {H{y) = Hi,F £ Nh^}. The sixth equality uses the 
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COS property of (j): PF{4>{y] Hi) = l \ H{y) = Hi) < a for any F G Nh-. Finally, the result 
follows since probabilities sum to less than or equal to 1. □ 

This result allows us to interpret the tests constructed via the COS method as uncon¬ 
ditionally valid. 

3.2 Marginal Screening, Orthogonal Matching Pursuit, and 
other Variable Selection methods 

In lieu of the developments of the previous section, it is clear that the COS method de¬ 
veloped for affine selection events in Chapter 2 is not specific to the lasso. By changing 
the variable selection method, we are simply changing the selection algorithm and the se¬ 
lection event. The main work is in characterizing the selection event {y : M{y) = M}, 
the event that the variable selection methods chooses the subset M. In this section, we 
characterize the selection event for several variable selection methods: marginal screening, 
orthogonal matching pursuit (forward stepwise), non-negative least squares, and marginal 
screening-|-lasso. 

3.2.1 Marginal Screening 

In the case of marginal screening, the selection event M (y) corresponds to the set of selected 
variables M and signs s: 

M{y) = |y : sign{xf y)xjy > ±xjy for all i G M and j G 
= |y : Sixfy > Txjy and Sixjy > 0 for alH G M and j G 
= {y:A{M,s)y<0} (3.2.1) 

for some matrix A{M, s). 

3.2.2 Marginal screening + Lasso 

The marginal screening-|-Lasso procedure was introduced in Fan and Lv (2008) as a variable 

/c 

selection method for the ultra-high dimensional setting of p = 0{e^ ). Fan et al. Fan and 
Lv (2008) recommend applying the marginal screening algorithm with k = n — 1, followed 
by the Lasso on the selected variables. This is a two-stage procedure, so to properly account 
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for the selection we must encode the selection event of marginal screening followed by Lasso. 
This can be done by representing the two stage selection as a single event. Let {Mrn,Sm) 
be the variables and signs selected by marginal screening, and the {Ml, zl) be the variables 
and signs selected by Lasso. In Proposition 2.2 of Lee et al. (2013a), it is shown how to 
encode the Lasso selection event {Ml,zl) as a set of constraints {Aiy < and in 

Section 3.2.1 we showed how to encode the marginal screening selection event {Mm, Sm) as 
a set of constraints {AmV < Thus the selection event of marginal screening+Lasso can 
be encoded as {AlV < hi, AmV < 0}. 

3.2.3 Orthogonal Matching Pursuit 

Orthogonal matching pursuit (OMP) is a commonly used variable selection method At 
each iteration, OMP selects the variable most correlated with the residual r, and then 
recomputes the residual using the residual of least squares using the selected variables. The 
description of the OMP algorithm is given in Algorithm 1. 

Algorithm 1 Orthogonal matching pursuit (OMP) 

1: Input: Design matrix X, response y, and model size k. 

2: for: i = 1 to k 
3: Pi = argmaxj=i^..,^p 

4: Si = {pi}. 

5 : ri+i = {I-X^Xpy. 

6: end for 

7: Output: S := {pi,... ,pk}, and fig = {XTXg)~^XTy 

The OMP selection event as a set of linear constraints on y. 

M{y) = {y : sign(xp.ri)xj.ri > Txjri, for all j / pi and all i € [A:]} 

= {y : Sixl{I - X^^_^X+^Jy > ±xj{I - X^^_^X+^Jy and 
SiXpS^ - X^._^X±_^Jy > 0, for all j ^ Pi, and all i G [k] } 

= |y : A{Mi,.. .,Mk, si, • • ■, 4) < ■ ■ .,Mk,h, ■ ■ • ,4)} • 


^The Lasso selection event is with respect to the Lasso optimization problem after marginal screening. 
^OMP is sometimes known as forward stepwise regression. 
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The selection event encodes that OMP selected a certain variable and the sign of the cor¬ 
relation of that variable with the residual, at steps 1 to fe. The primary difference between 
the OMP selection event and the marginal screening selection event is that the OMP event 
also describes the order at which the variables were chosen. The marginal screening event 
only describes that the variable was among the top k most correlated, and not whether a 
variable was the most correlated or kth most correlated. 

3.2.4 Nonnegative Least Squares 

Non-negative least squares (NNLS) is a simple modification of the linear regression estimator 
with non-negative constraints on /3: 

arg nim ^ ||y-X/3f . (3.2.2) 

Under a positive eigenvalue conditions on X, several authors Slawski et al. (2013); Mein- 
shausen et al. (2013) have shown that NNLS is comprable to the Lasso in terms of prediction 
and estimation errors. The NNLS estimator also does not have any tuning parameters, since 
the sign constraint provides a natural form of regularization. NNLS has found applications 
when modeling non-negative data such as prices, incomes, count data. Non-negativity con¬ 
straints arise naturally in non-negative matrix factorization, signal deconvolution, spectral 
analysis, and network tomography; we refer to Chen and Plemmons (2009) for a compre¬ 
hensive survey of the applications of NNLS. 

We show how our framework can be used to form exact hypothesis tests and confidence 
intervals for NNLS estimated coefficients. The primal dual solution pair (/3, A) is a solution 
iff the KKT conditions are satisfied. 


Aj := —xf {y — XP) > 0 for all i 

/3 > 0 . 

Let M = {i : —xf{y — Xj3) = 0}. By complementary slackness = 0, where —M is 
the complement to the “active” variables M chosen by NNLS. Given the active set we can 


CHAPTER 3. CONDITION-ON-SELECTION METHOD 


43 


solve the KKT equation for the value of /I 

-Xj(y-X/3) = 0 

4 = 4^’ 

which is a linear contrast of y. The NNLS selection event is 

M{y) = {y : X^iy - X$) = 0, X^^{y - XP) > 0} 

= {y:Xl{y- X/3) > 0, -X^iy - XP) > 0, X^^{y - XP) > 0} 

= {y ■■ -^M^py > o> -xpi-XMXpy > 0, x^^{i- x^x±)y > 0} 

= {y ■ A{M)y < 0 }. 

The selection event encodes that for a given y the NNLS optimization program will select 
a subset of variables M{y). 

3.2.5 Logistic regression with Screening 

The focus up to now has been on the linear regression estimator with additive Gaussian 
noise. In this section, we discuss extensions to conditional MLE (maximum likelihood esti¬ 
mator) such as logistic regression. This section is meant to be speculative and non-rigorous; 
our goal is only to illustrate that these tools are not restricted to the linear regression. A 
future publication will rigorously develop the inferential framework for conditional MLE. 
Consider the logistic regression model with loss function and gradient below, 

1 / ” 

m = - -y'^xp + Y, log(l + 

^ V 4 

vm = --X^{y-s{X(3)), 
n 

where s{Xj3) is the sigmoid function applied entrywise. By taylor expansion, the empirical 
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estimator is given by 


/3 Ri /3° - ^ V£{/3°) 

= /3°+(v2£(/3°))“'x^(y-s(X/3)) 

By the Lindeberg CLT (central limit theorem), ^X'^{y—s{X/3^)) —> A/’(0, E(V^£(/3°))), 
and thus w := -^X'^y converges to a Gaussian. The marginal screening selection procedure 
can be expressed as a set of inequalities {sign(t(;j)t(;i > ±Wj,i G M,j G M‘^} = {Aw < 
b}. Thus conditional on the selection, w is approximately a constrained Gaussian. The 
framework in Ghapter 2.4 and 3.1 can be applied to w, instead of y, to derive hypothesis 
tests and confidence intervals for the coefficients of logistic regression. The resulting test 
and confidence intervals should be correct asymptotically. However, this is the best we can 
expect for logistic regression and other conditional MLE because even in the classical case 
the Wald test is only asymptotically correct. For other conditional maximum likelihood 
estimator similar reasoning applies, since the gradient V£(/3) converges in distribution to a 
Gaussian. 

For logistic regression with £i regularizer, ^ (^—y'^Xfl + X]r=i + A ||/3||;^ 

the selection event cannot be analytically described. However, the COS method can still 
be applied using the general method presented in Chapter 3.3. 


3.3 General method for Selective inference 


In this section, we describe a computationally-intensive algorithm for finding selection 
events, when they are not easily described analytically. 

We first review the construction used in Chapter 2 for affine selection events. Let 
Pin^y) = {I — ^T^)y- Recall that y can be decomposed into two independent components 
y = + {I - This is derived by defining y = Yr^l'^y ~ AA(0,/) and 

f] = y can be orthogonally decomposed as y = {'n'^y)T\m A (I ~ so 




y = = {ii'^y)XA + 


= iTv)AN + (i-^)y- 
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Lemma 2.4.1 shows that 

r]'^y\{Ay < b,P^,jy = yo} ~ TN{rfy,a‘^ \\y\\^ ,V~{yo, A,h),V"^{yo, A,b)). 

We can generalize this result to arbitrary selection events, where the selection event is 
not explicitly describable. Recall that H \s a. selection algorithm that maps R” —>• %. The 
selection event is S{H) = {x : H{x) = H}, so y G S{H) iff H{y) = H. In the upcoming 
section, it will be convenient to work with the definition using H{-), since the set S{H) 
cannot be described, but the function H(-) can be efficiently computed. Thus we can only 
verify if a point y G S{H). 

The following Theorem is a straightforward generalization of Theorem 2.4.2 from poly¬ 
hedral sets to arbitrary sets S. 

Theorem 3.3.1 (Arbitrary selection events). Let y be a multivariate truncated normal, so 
L{y) oc e(-i(y - y)'^E-^{y - y))l{y G S{H)). Then 

V^y\{y e S{H),P^.,^y = yo} = TN{if ix,ifEy,U{H,yo, 
and U{H,yo, ^)) = {c : H{yo + c^) = H}. 

Proof. We know that r/'^y|{y G S{H),P^^^y = yo} = TN{t]'^ y.,\\rif ,U{H,yo, so 

v'^y\{y £ 'S'(R), P^rjV ~ Vo} is a univariate normal truncated to some region U. The goal is 
to check that U{H, yo, Sr/) = {c : H{yo + = H}. We can describe the conditioning 

set as 


{y:yG S{H),P^,^y = yo} = {y : H{y) = H, = r/o} 
= {y = yo + :Hiyo + = H, P^^^y = yo} 

= {y = yo + : Pi^r^y = yo,cGU (H, yo, c^^)} 

= {y ■ Pi,vy = yo,y^y e uiH,yo,c^^^^^)} 
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Thus we have that 

v'^y\{y £ S{H), P^,^y = yo} = 

d 


~ TN{ifn,y'^Er],U{H, yo, 

where the second equality follows from independence of rj^y and P^ rjV- O 


y^y\y^y G U{H,yo, = l/o} 

y^y\y^y G U{H,yo, 

77 -' Lr] 


3.3.1 Computational Algorithm for arbitrary selection algorithms 

In this section, we study the case of where the set S{H) cannot be explicitly described, 
but the function H{-) is easily computable. Our goal will be to approximately compute the 
p-value F{y^y;y^^i,U{H,yo, 7 ^!^)), where F is the cdf of TN{y^y,,y^Ey,U{H,yo, :^^)- 
Algorithm 2 is the primary contribution of this section. This allows us to compute 
the pivotal quantity for algorithms H{-) with difficult to describe selection events. This 
includes linear regression with the SCAD/MCP regularizers, and logistic regression with 
£i-regularizer, where the selection events do not have analytical forms. 

Let ^{z',v,a) = be the pdf of a univariate truncated normal with mean u and 

variance cr^. Algorithm 2 gives an approximate p-value for the null hypothesis Hq : rf"yi = 7. 

Algorithm 2 Compute approximate p-value 

Input: Grid points D = {di, ..., dn} and empty set C = 0 
Output: Approximate p-value p 
for all di £ D do 

Compute Hi = H{yo + ^ 77 ^)- 

if Hi = H then 
C = C0di. 

end if 
end for 
Return: 

^ Ecec,c<riTy 0(c; 7, 

^ Ecec7, 


The advantage of this algorithm is it does not need an explicit description of the set S, nor 
the set U. It runs the selection algorithm H{-) at the grid points di, and determines if the 
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point yo + is in the selection event. Then it approximates the CDF of the univariate 

truncated normal by a discrete truncated normal. 

Conjecture 3.3.2. Let Dm be a set of grid points 2m? grid points that is equispaced on 
[—m, m ]. Let U C Li be an open interval, and pm be the p-value from Algorithm 2 using 


Dm- We have 


lim Pm = F{rfy, 7 , U{H, go, 

m^oo T}^ 


))• 


3.4 Inference in the full model 

In Chapter 2, we focused on inference for the submodel coefficients /3^ = X^p,. In selective 
inference, the choice of the model M is selected via an algorithm e.g. the lasso, and the 
COS method constructed confidence intervals 


One possible criticism of the selective confidence intervals for submodel coefficients is 
the interpretability of the quantity I ?*since this is the population regression coefficient 
of variable j within the model M. The signihcance of variable j depends on the choice of 
model meaning variable j can be significant in model Mi, but not significant in M 2 , which 
makes interpretation difficult. 

However, this is not an inherent limitation of the COS method. As we saw in the 
previous two sections, the COS method is not specihc to the submodel coefficients. We 
simply need to change the space of hypothesis H and the selection function H to perform 
inference for other regression coefficients. 

In many scientific applications, the quantity of interest is the regression coefficient within 
the full model M = [l...p]. We hrst discuss the case of n > p. Let us assume that 
y ~ J\f{p,cr'^I). In ordinary least squares , the parameter of interest is /3° = and a 

classical conhdence interval guarantees 

P(/3° G Cj) = l-a. 


In the case of least squares after variable selection, we only want to make a confidence 
interval for the j G M, or variables selected by the lasso. This corresponds to inference for 
a subset = Ej^jl^, where Em selects the coordinates in M. The interpretation of (3^ 
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for j £ M IS clear; this is the regression coefficient of the least squares coefficient restricted 
to the set selected by the lasso. 

For each coefficient j £ M, Equation (2.5.1) provides a valid p-value of the hypothesis 


where r]j = {X'^)'^ej. By inverting, we obtain a selective confidence interval 


(3.4.1) 


P 




G Co 1 =1- a. 


(3.4.2) 


3.4.1 False Discovery Rate 

In this section, we show how to combine selective confidence intervals with the Benjamini- 
Yeuketieli procedure for FDR control. False discovery rate (FDR) is defined as, 


E 


V 

R 


where V is the number of incorrectly rejected hypotheses and R is the total number of 
rejected hypotheses. We will restrict ourselves to the case of the well-specified linear model, 
y = X/3^ -£ e, and n > p with X having full rank. In the context of linear regression, there 
is a sequence of hypotheses Hqj : /I? = 0 and a hypothesis is considered to be incorrectly 
rejected if Hqj is true, yet the variable is selected. 

Given p-values, we can now apply the Benjamini-Yekutieli procedure (Benjamini et ah, 
2001) for FDR control. Let < p( 2 ) < < P{\M\) order statistics, and = 

ES i- Let /c be 


k = max 


k : P(k) < 


\M\h 


-a 


\M\ 


(3.4.3) 


then reject p(i),... ,p(fc). 


Theorem 3.4.1. Consider the procedure that forms p-values using Equation (3.4.1), chooses 
k via Equation (3.4.3), and rejects p(i),... ,P(fc). Then FDR is controlled at level a. 

Proof. Conditioned on the event that variable j is in the lasso active set, j £ M, then pj is 
uniformly distributed among the null variables. Applying the Benjamini-Yekutieli procedure 
to the p-values guarantees FDR. The Benjamini-Yekutieli procedure allows 
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for arbitrary dependence among the p-values, and only requires that the null p-values are 
uniformly distributed □ 

3.4.2 Intervals for coefficients in full model when n < p 

In this section, we present a method for selective inference for coordinates of the full-model 
parameter /3°. We will assume the sparse linear model, namely, 


y = Xp^ + e 


where e ~ AA(0, and f3^ is s-sparse. Since n < p,we cannot use the method in the previous 
section since 7 ^ X~^X/3^. Instead, we will construct a quantity that is extremely close 
to and show that /3j = pJ{Xj5^) -\- hj. We do this by constructing a population version 
of the debiased estimator. 

The debiased estimator presented in Javanmard and Montanari (2013); van de Geer 
et al. (2013); Zhang and Zhang (2014) is 


n 

= + (I - 0S)/3 

n 

= -QX^y + {I -et) 
n 


hXNlv - 
0 


where := ^X'^Xj^ and 0 is an approximate inverse covariance that is the solution to 


min ^ ej S 0 j 
3 



/ 

te-i 

<cJ 

00 V 


logp 

n 


subject to 
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Define the population quantity by replacing all occurrences of y with y: 

s) := -QX^y + (/ - 0S) ~ ® 

n y 0 

= ^° ++ (/- 0S) Pm ^3 4 4^ 

” L -P-M \ 

= + {I- 0S)FmSm'^m) P-KI- Qt)FMt-^s 

:= By -\- /i, 

where Fm is the matrix such that it takes an |M| vector and pads with 0 to make a p vector. 

By choosing y as a row of B, COS framework provides a selective test and confidence 
interval, 
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Proof. Starting from Equation (3.4.4), we have 


= (i-et) 


- /30 

n. A/f' A/f “ 


n M M 


M ' M 


-/?' 


I3<^-13^ 


< 


/-0S 


-M 




-/ 3 ' 


-M 


< ce 


< Ce 


< Ce 


logp 


n 


logp 


n 


logp 


n 


- XENs - 13 

fl M ^ 


M 


- X±-.}s - /3°> 

fl M M 


+ 


+ 


13' 


-M 


(3-13' 


-Efjxjy^i - XE-fs - /3\, 

M MC" m 


«0 


+ CLS 


logp 


n 


where we used the lasso consistency assumption, 
to last inequality uses the fact that I3_j^ = 0, so 

-/3° - " 


QE-I 

/3-/3' 


— second 




+ 


-/ 3 ° - 


> 
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, where s = |M|,and > fi{s) > fi{cMs). 

Plugging this into the expression for — /3°||, 


< C0 



n \\^{T,cms) 


Cm + Cl \ s 



+ CLS 





CM + ‘^ClCq 


s logp 


(3.4.5) 

(3.4.6) 


□ 


Lemma 3.4.3 (Assumptions hold under random Gaussian design with additive Gaussian 
noise). Assume that the rows of X ^ AA(0, S) and lim = oo. Then the estimation 
consistency property, existence of a good approximation 0, empirical sparsity s < cms, and 
p,{E,CMs) > |/u(S,cms) with probability tending to 1. 
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Proof. The estimation consistency property follows from Negahban et al. (2012). The bound 


S0-/ 


< 


is established in Javanmard and Montanari (2013). The empirical 
sparsity result s < cms is from Belloni et al. (2011, 2013). 

The condition on concentration of sparse eigenvalues can be derived using Loh and 
Wainwright (2012, Lemma 15, Supplementary Materials). Lemma 15 states if X is a zero- 
mean sub-Gaussian matrix with covariance S and subgaussian parameter cr^, then there is 
a universal constant c > 0 such that 


1 


sup |—Xu — Sn| > t < 2exp I —cnmin(^, — 2 ) + slogp ) . (3.4.7) 
||i;||o<s,||r|| 2 =l ^ y V CT CJ 


t 


With high probability and for all v G K{s) := {u : ||u||g < s, ||u ||2 = 1}, 


{v'^'Ev — v'^Ev\ < t 

v'^Ev — t < v^Ev 
min w Ew — t < v Ev 

wGK{s) 

jj.{E, s) — t < v^Ev 
u(E,s)—t< min v'^Ev 

v£K{s) 

fiiE, s) — t < fj,{E, s). 


We now use this to show ^{E,Cms) > ^ij,{E,cms). Let t = ^^{E,cms), then by the 
previous argument and Equation (3.4.7), 

fj,{E,CMs) > ^ii{E,cms) 


with probability at least 


1 — 2 exp ( —cn min( 


,Ix{E,cmsY piE, cms). 


4a^ 


2^2 


-I- CM slog p 


For n > - MS ogp -^ have with probability at least, 

-jyi-) 


1 — 2 exp 


M r 

-cn mini 

V 2 ^ 


Ij,{E,cms)‘^ 
' 4a^ 


p{E,cms) f\ 

^^a 2 '’) ■ 
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□ 


Corollary 3.4.4. 


Under the assumptions of Lemma 3.4-3, and lim ^ = 00 , 


/3'^iM,s)-/3^ 



for any <5 > 0. 

Proof. Lemma 3.4.3 ensures that /r(S, s) > ^/r(S, s), and plugging this into Equation (3.4.6) 
gives 




< 


00 ■ \p{E,cms) 


2ce , o \ 

CM + ^ClCq 


n 


Since n ^ s^ log^ p, we have 




< 


2c0 , o ^ ■slogp 

-Cm + 2cLCe 


00 \ij.(Yj,cms) ) n 

( Nm + 2cLCe^ o(^) 

\Pl{L,cms) ) y/n 

5 


< 


< 


n 


□ 


Corollary 3.4.5. Let Cj be a selective confidence interval for (3^ meaning P(/3^ G Cj) = 
1 — a, then 

liminf P(/3)* G Cj ± —j=) >1 — 0 . 

y/n 

Proof. With probability at least 1 — a, /3j G Cj and with probability tending to 1 — o(l), 
l^j ~ l^j < Thus with probability at least 1 — a — o(l), fij € Cj ± □ 

Figure 3.1 shows the results of a simulation study. It makes clear that the intervals 
of Javanmard and Montanari (2013) and our selective conhdence intervals cover which 
is close to /3^. The Javanmard-Montanari intervals are the high-dimensional analog of a 
z-interval, so they are not selectively valid, unlike the selective intervals in blue. 
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Figure 3.1: Confidence intervals for the coefficients in a design with n = 25, p = 50, and 5 
non-zero coefficients. Only the first 20 coefficients are shown. The dotted line represents 
the true signal, and the points represent the (biased) post-selection target. The colored 
bars denote the intervals. 


3.5 Selective Inference for the Knockoff Filter 


In this section, we show how to make selectively valid confidence intervals for the knockoff 
method Foygel Barber and Candes (2014). Let X be the knockoff design matrix, so the 
knockoff regression is done on y = \X]X](3 + e. The introduction of the knockoff variables, 
X, allows us to estimate the FDP as the number of knockoff variables selected divided by 
the number of true variables selected: 


FDP{M) 


|Mnx| 
|Mnx| V 1 


(3.5.1) 


Given a sequence of models M(l ),... ,M{k) be a sequence of nested models M{k) C 
M(A: — 1) C ... C M(l) C [1 .. 2p]. For the lasso, where the M{j) correspond to the lasso 
active set at Xj, the models are not necessarily nested. We define M(j) = where 

Aj is the active set of lasso at Xj. We have an estimate FDP estimate for each model, 
FDP{M{j)) = ’ where X and X represent the indices of the real and knockoff 

variables respectively. This suggests selecting the largest model such that the FDP estimate 
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is less than a, 


T = min{t : FDP{M{t)) < a}. (3.5.2) 

To show this controls modified FDR, we need to construct VF-statistics such that our stop¬ 
ping rule, corresponds to the stopping rule of Foygel Barber and Candes (2014). 

Theorem 3.5.1. The model selected by the stopping rule in Equation (3.5.2) controls the 
modified FDR, that is 

r |M(T)nR| 1 
[|M(r)nx|+ i/aj - 

where V = {I < j < p '■ fij = 0}. 

Theorem 3.5.2. We construct some W statistics. Define tj = min{t : xj G Tf(t)} and 
tj = min{t : xj G M{t)}. Define 


W,= 



if tj < tj 
if ij < tj 


We now verify that the FDP estimate given in Foygel Barber and Candes (2014) using 
the W-statistics are the same FDP estimate as (3.5.1). 


FDPwit) = 


\{j ■■ W, < -t}| 
\{j : W, > t} V 1 
{J '■ ij >t,j ^ X} 

\{j : tj > t,j G X}| V 1 
^ {j : M(t) n X} 
|{j:M(t)nX}| VI- 


By invoking the main theorem of Foygel Barber and Candes (2014), we see that (3.5.2) 
controls the modified FDR. 


By using the FDP^ estimate in place of equation (3.5.2), 


FDP+{M) 


\Mnx\ 

|Mnx| V 1 -I-1 


T+ = min{t : FDP+{M{t)) < a}. 


(3.5.3) 

(3.5.4) 
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we can control FDR, instead of modified FDR. 

Theorem 3.5.3. The model selected by the stopping rule in Equation (3.5.4) controls FDR, 
that is 

r |M(T)nF| 1 
[|M(r)nx| vij - 

where V = {1 < j < p : /3j = 0}. 

Proof. Same as the previous theorem. □ 

Let M* = KO{y) be the final model returned by the knockoff procedure k applied to 
the regression pair {y,X) using the lasso models at the sequence Ai,..., A^. Our goal is 
to do inference for /3? = ejX~^y for some j G M*. The selection event, the set of y’s that 
lead us to testing /3^, is Sj = {y : j G KO{y)'\. This precise set is difficult to analytically 
describe, so we resort to Algorithm 2. 

We can analytically describe the finer event 


S = {y. {L{y, Ai),..., L{y, Ar+i)) = (M(l), ...,M{T + 1))} C Sj, 

where L{y, A) is the active set of lasso at A. For any y G S, the knockoff procedure defined 
by the stopping rule (3.5.2) returns the same set of variables, so S C Sj. The set S is 
described by the intersection of the union of linear inequalities given in Section 2.7. This 
allows us to do inference using the results of Theorem 2.4.3. 

We next describe a method using the general method of Chapter 3.3. Using the COS 
method, we first describe the knockoff selection event. The selection event for variable j is 
Sj = {y ■ j G KO{y)}. The general method instead uses the one-dimensional finer selection 
event Uj = {c : j G KO{P^j^y This set is approximated using Algorithm 2 that 

computes an approximation to Uj and an approximate p-value. 

Since the knockoff method assumes a well-specified linear model, we can use the reference 
distribution y ~ Af{X/3,a‘^I) instead of y ~ Af{y,a‘^I). This is the well-specified linear 
regression model of Fithian et al. (2014). The selection event is now Uj = {C : j G 
KO{Pji .y C),C G span(A_j)■*■}. A multi-dimensional analog of Algorithm 2 can now 
be applied, but the search set D is now over a, n — p + 1 dimensional subset. 
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Chapter 4 

Learning Mixed Graphical Models 

4.1 Introduction 

Many authors have considered the problem of learning the edge structure and parameters 
of sparse undirected graphical models. We will focus on using the li regularizer to promote 
sparsity. This line of work has taken two separate paths: one for learning continuous valued 
data and one for learning discrete valued data. However, typical data sources contain both 
continuous and discrete variables: population survey data, genomics data, url-click pairs etc. 
For genomics data, in addition to the gene expression values, we have attributes attached 
to each sample such as gender, age, ethniticy etc. In this work, we consider learning mixed 
models with both continuous Gaussian variables and discrete categorical variables. 

For only continuous variables, previous work assumes a multivariate Gaussian (Gaus¬ 
sian graphical) model with mean 0 and inverse covariance 0. 0 is then estimated via the 
graphical lasso by minimizing the regularized negative log-likelihood £(0) -|- A ||0||]^. Sev¬ 
eral efficient methods for solving this can be found in Friedman et al. (2008a); Banerjee 
et al. (2008). Because the graphical lasso problem is computationally challenging, several 
authors considered methods related to the pseudolikelihood (PL) and nodewise regression 
(Meinshausen and Biihlmann, 2006; Friedman et ah, 2010a; Peng et ah, 2009). For dis¬ 
crete models, previous work focuses on estimating a pairwise Markov random held of the 
form p{y) oc ^rjiUrjUj), where iprj are pairwise potentials. The maximum likeli¬ 

hood problem is intractable for models with a moderate to large number of variables (high¬ 
dimensional) because it requires evaluating the partition function and its derivatives. Again 
previous work has focused on the pseudolikelihood approach (Guo et ah, 2010; Schmidt, 
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2010; Schmidt et ah, 2008; Hofling and Tibshirani, 2009; Jalali et ah, 2011; Lee et ah, 2006; 
Ravikumar et ah, 2010). 

Our main contribution here is to propose a model that connects the discrete and con¬ 
tinuous models previously discussed. The conditional distributions of this model are two 
widely adopted and well understood models: multiclass logistic regression and Gaussian 
linear regression. In addition, in the case of only discrete variables, our model is a pairwise 
Markov random field; in the case of only continuous variables, it is a Gaussian graphical 
model. Our proposed model leads to a natural scheme for structure learning that general¬ 
izes the graphical Lasso. Here the parameters occur as singletons, vectors or blocks, which 
we penalize using group-lasso norms, in a way that respects the symmetry in the model. 
Since each parameter block is of different size, we also derive a calibrated weighting scheme 
to penalize each edge fairly. We also discuss a conditional model (conditional random field) 
that allows the output variables to be mixed, which can be viewed as a multivariate re¬ 
sponse regression with mixed output variables. Similar ideas have been used to learn the 
covariance structure in multivariate response regression with continuous output variables 
Witten and Tibshirani (2009); Kim et al. (2009); Rothman et al. (2010). 

In Section 4.2, we introduce our new mixed graphical model and discuss previous ap¬ 
proaches to modeling mixed data. Section 4.3 discusses the pseudolikelihood approach to 
parameter estimation and connections to generalized linear models. Section 4.4 discusses 
a natural method to perform structure learning in the mixed model. Section 4.5 presents 
the calibrated regularization scheme. Section 4.6 discusses the consistency of the estimation 
procedures, and Section 4.7 discusses two methods for solving the optimization problem. 
Finally, Section 4.8 discusses a conditional random field extension and Section 4.9 presents 
empirical results on a census population survey dataset and synthetic experiments. 

4.2 Mixed Graphical Model 

We propose a pairwise graphical model on continuous and discrete variables. The model is 
a pairwise Markov random field with density p(x, y; 0) proportional to 

( P P ^ P p q q g \ 

'^'^-^PstXsXt + '^asXs + '^'^Psj{yj)xs + '^'^4>rjiyr,yj) ■ (4.2.1) 

s=l t=l s=l j=l j=l r=l J 
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Here Xg denotes the sth of p continuous variables, and yj the jth of q discrete variables. 
The joint model is parametrized by 0 = [{Pst}, {ots} , {Psj}, {4>rj}]- The discrete yr takes 
on Lr states. The model parameters are Pst continuous-continuous edge potential, as con¬ 
tinuous node potential, PsjiVj) continuous-discrete edge potential, and prjiyr^yj) discrete- 
discrete edge potential. Psj{yj) is a function taking Lj values /9 sj(l),... ,psj{Lj). Similarly, 
4>rj{yr,yj) is a bivariate function taking on Lr x Lj values. Later, we will think of psj{yj) 
as a vector of length Lj and prjiyr, yj) as a matrix of size Lr x Lj. 

The two most important features of this model are: 

1. the conditional distributions are given by Gaussian linear regression and multiclass 
logistic regressions; 

2. the model simplihes to a multivariate Gaussian in the case of only continuous variables 
and simplifies to the usual discrete pairwise Markov random held in the case of only 
discrete variables. 

The conditional distributions of a graphical model are of critical importance. The absence 
of an edge corresponds to two variables being conditionally independent. The conditional 
independence can be read off from the conditional distribution of a variable on all others. 
For example in the multivariate Gaussian model, Xg is conditionally independent of xt iff the 
partial correlation coefficient is 0. The partial correlation coefficient is also the regression 
coefficient of xt in the linear regression of Xg on all other variables. Thus the conditional 
independence structure is captured by the conditional distributions via the regression co¬ 
efficient of a variable on all others. Our mixed model has the desirable property that the 
two type of conditional distributions are simple Gaussian linear regressions and multiclass 
logistic regressions. This follows from the pairwise property in the joint distribution. In 
more detail: 


1. The conditional distribution of yr given the rest is multinomial, with probabilities 
defined by a multiclass logistic regression where the covariates are the other variables 
Xs and y\r (denoted collectively by z in the right-hand side): 


P{yr = k\y\r,x]Q) 


exp (oj^z) 

Ezti exp {ujJz) exp (utoi + Ej 


(4.2.2) 


Here we use a simplified notation, which we make explicit in Section 4.3.1. The discrete 
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variables are represented as dummy variables for each state, e.g. zj = l[7/„ = fc], and 
for continuous variables Zg = Xg. 

2. The conditional distribution of Xg given the rest is Gaussian, with a mean function 
defined by a linear regression with predictors x\g and yr- 


E(xglx\g,yr;0) = Oj'^z = CUq + ZjCJj 


(4.2.3) 


p(Xglx\g,2/r;0) = 


\f^Og 


1 


exp [-^(Xs 


As before, the discrete variables are represented as dummy variables for each state 
Zj = l[yu = k] and for continuous variables Zg = Xg. 

The exact form of the conditional distributions (4.2.2) and (4.2.3) are given in (4.3.5) and 
(4.3.4) in Section 4.3.1, where the regression parameters ojj are defined in terms of the 
parameters 0. 

The second important aspect of the mixed model is the two special cases of only con¬ 
tinuous and only discrete variables. 


1. Continuous variables only. The pairwise mixed model reduces to the familiar multi¬ 
variate Gaussian parametrized by the symmetric positive-dehnite inverse covariance 
matrix B = {/3gt} and mean /r = B~^a, 

p{x) oc exp f “ 2 ^^ ~ B~^a)^B{x — B~^a) 



2. Discrete variables only. The pairwise mixed model reduces to a pairwise discrete 
(second-order interaction) Markov random field, 


p{y) oc exp 


g g 

yy (Vr, Vj) 

j=l r=l 


Although these are the most important aspects, we can characterize the joint distri¬ 
bution further. The conditional distribution of the continuous variables given the discrete 
follow a multivariate Gaussian distribution, p{x\y) = Af{p{y), B~^). Each of these Gaussian 
distributions share the same inverse covariance matrix B but differ in the mean parameter, 
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since all the parameters are pairwise. By standard multivariate Gaussian calculations, 


p(x\y)=J\f(B 


(4.2.4) 

(4.2.5) 






(4.2.6) 


Thus we see that the continuous variables conditioned on the discrete are multivariate 
Gaussian with common covariance, but with means that depend on the value of the dis¬ 
crete variables. The means depend additively on the values of the discrete variables since 
{ 7 ( 2 /)}^ = Y2j=i PsjiVj)- The marginal p{y) has a known form, so for models with few 
number of discrete variables we can sample efficiently. 

4.2.1 Related work on mixed graphical models 

Lauritzen (1996) proposed a type of mixed graphical model, with the property that con¬ 
ditioned on discrete variables, p{x\y) = Af{p{y),E{y)). The homogeneous mixed graphical 
model enforces common covariance, E{y) = S. Thus our proposed model is a special case 
of Lauritzen’s mixed model with the following assumptions: common covariance, additive 
mean assumptions and the marginal p{y) factorizes as a pairwise discrete Markov random 
field. With these three assumptions, the full model simplifies to the mixed pairwise model 
presented. Although the full model is more general, the number of parameters scales ex¬ 
ponentially with the number of discrete variables, and the conditional distributions are not 
as convenient. For each state of the discrete variables there is a mean and covariance. 
Consider an example with q binary variables and p continuous variables; the full model 
requires estimates of 2*? mean vectors and covariance matrices in p dimensions. Even if the 
homogeneous constraint is imposed on Lauritzen’s model, there are still 2^ mean vectors for 
the case of binary discrete variables. The full mixed model is very complex and cannot be 
easily estimated from data without some additional assumptions. In comparison, the mixed 
pairwise model has number of parameters 0{{p+q)‘^) and allows for a natural regularization 
scheme which makes it appropriate for high dimensional data. 

An alternative to the regularization approach that we take in this paper, is the limited- 
order correlation hypothesis testing method Tur and Castelo (2012). The authors develop a 
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hypothesis test via likelihood ratios for conditional independence. However, they restrict to 
the case where the discrete variables are marginally independent so the maximum likelihood 
estimates are well-defined for p > n. 

There is a line of work regarding parameter estimation in undirected mixed models that 
are decomposable: any path between two discrete variables cannot contain only continu¬ 
ous variables. These models allow for fast exact maximum likelihood estimation through 
node-wise regressions, but are only applicable when the structure is known and n > p (Ed¬ 
wards, 2000). There is also related work on parameter learning in directed mixed graphical 
models. Since our primary goal is to learn the graph structure, we forgo exact parameter 
estimation and use the pseudolikelihood. Similar to the exact maximum likelihood in de¬ 
composable models, the pseudolikelihood can be interpreted as node-wise regressions that 
enforce symmetry. 

To our knowledge, this work is the first to consider convex optimization procedures for 
learning the edge structure in mixed graphical models. 

4.3 Parameter Estimation: Maximum Likelihood and Pseu¬ 
dolikelihood 

Given samples {xi,yi)2^i, we want to find the maximum likelihood estimate of 0. This can 
be done by minimizing the negative log-likelihood of the samples: 

n 

£(0) = - ^ logp(xi, Pi] 0) where 
i=l 

p p ^ p p q 

logp(x,?/;0) = EE --/SstXgXt -h ^ asXs + EE PsjiUpXs 

S = 1 t=l S = 1 S=1 j = l 

Q 3 

+ EE 4>rj{yr,yj) - logZ(0) 

i=l r=l 

The negative log-likelihood is convex, so standard gradient-descent algorithms can be used 
for computing the maximum likelihood estimates. The major obstacle here is Z(Q), which 
involves a high-dimensional integral. Since the pairwise mixed model includes both the 
discrete and continuous models as special cases, maximum likelihood estimation is at least 
as difficult as the two special cases, the first of which is a well-known computationally 


(4.3.1) 


(4.3.2) 
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intractable problem. We defer the discussion of maximum likelihood estimation to the 
supplementary material. 


4.3.1 Pseudolikelihood 

The pseudolikelihood method Besag (1975) is a computationally efficient and consistent 
estimator formed by products of all the conditional distributions: 

p <? 

i{Q\x, y) = ^ogp{xs\x\s,y\y\A ®) (4.3.3) 

s=l r=l 

The conditional distributions p{xs\x\s,y]9) and p{yr = k\y\r,jX'i^) on the familiar 
form of linear Gaussian and (multiclass) logistic regression, as we pointed out in (4.2.2) and 
(4.2.3). Here are the details: 


• The conditional distribution of a continuous variable Xg is Gaussian with a linear 
regression model for the mean, and unknown variance. 


P{xs\x\g,y,e) 




/ as + Ej PsjiVj) - ^stxt 
V f^ss 



(4.3.4) 


• The conditional distribution of a discrete variable yr with Lr states is a multinomial 
distribution, as used in (multiclass) logistic regression. Whenever a discrete variable 
is a predictor, each of its levels contribute an additive effect; continuous variables 
contribute linear effects. 


P{yr\y\r,,X]e) 


exp (Es Psriyr)Xs + (t>rr{yr,yr) + 4>rjiyr,yj)^ 

exp (Es Psr{l)Xs + (t)rr{l, 1) + Ej^r ^rj{l,yj)) 


(4.3.5) 


Taking the negative log of both gives us 


-logpixs\x\g,y,e) = -^logPss + ^ f ^ + ^ 


Psj (%■ 


Pst 


ss ■ Hss , / Hss 

3 t^s 


- logp(yr|y\r,, 0) = - log 


exp (Es Psr{yr)Xs + ^rriPr, Vr) + T.j^r Prjiyr, i/j)) 
Ef=l exp (Es Psr{l)Xs + 4>rr{l, 1) + Ej^r Prj{l, %)) 


(4.3.6) 

(4.3.7) 
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A generic parameter block, Buy, corresponding to an edge (u,v) appears twice in the pseu¬ 
dolikelihood, once for each of the conditional distributions p{zu\zy) and p{zy\zu)- 

Proposition 4.3.1. The negative log pseudolikelihood in (4.3.3) is jointly convex in all the 
parameters {/Iss, /3st,as,4'rj, Psj} over the region fiss > 0. 

We prove Proposition 4.3.1 in the Supplementary Materials. 

4.3.2 Separate node-wise regression 

A simple approach to parameter estimation is via separate node-wise regressions; a general¬ 
ized linear model is used to estimate p{zs\z\s) for each s. Separate regressions were used in 
Meinshausen and Biihlmann (2006) for the Gaussian graphical model and Ravikumar et al. 
(2010) for the Ising model. The method can be thought of as an asymmetric form of the 
pseudolikelihood since the pseudolikelihood enforces that the parameters are shared across 
the conditionals. Thus the number of parameters estimated in the separate regression is 
approximately double that of the pseudolikelihood, so we expect that the pseudolikelihood 
outperforms at low sample sizes and low regularization regimes. The node-wise regression 
was used as our baseline method since it is straightforward to extend it to the mixed model. 
As we predicted, the pseudolikelihood or joint procedure outperforms separate regressions; 
see top left box of Figures 4.6 and 4.7. Liu and Ihler (2012, 2011) confirm that the separate 
regressions are outperformed by pseudolikelihood in numerous synthetic settings. 

Concurrent work of Yang et al. (2012, 2013) extend the separate node-wise regression 
model from the special cases of Gaussian and categorical regressions to generalized linear 
models, where the univariate conditional distribution of each node p{xs\x\s) is specified by a 
generalized linear model (e.g. Poisson, categorical, Gaussian). By specifying the conditional 
distributions, Besag (1974) show that the joint distribution is also specified. Thus another 
way to justify our mixed model is to define the conditionals of a continuous variable as 
Gaussian linear regression and the conditionals of a categorical variable as multiple logistic 
regression and use the results in Besag (1974) to arrive at the joint distribution in (4.2.1). 
However, the neighborhood selection algorithm in Yang et al. (2012, 2013) is restricted to 
models of the form p{x) oc exp + Ylst^stXsXt + particular, this 

procedure cannot be applied to edge selection in our pairwise mixed model in (4.2.1) or the 
categorical model in (2) with greater than 2 states. Our baseline method of separate regres¬ 
sions is closely related to the neighborhood selection algorithm they proposed; the baseline 
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can be considered as a generalization of Yang et al. (2012, 2013) to allow for more general 
pairwise interactions with the appropriate regularization to select edges. Unfortunately, 
the theoretical results in Yang et al. (2012, 2013) do not apply to the baseline nodewise 
regression method, nor the joint pseudolikelihood. 

4.4 Conditional Independence and Penalty Terms 

In this section, we show how to incorporate edge selection into the maximum likelihood or 
pseudolikelihood procedures. In the graphical representation of probability distributions, 
the absence of an edge e = (u, v) corresponds to a conditional independency statement 
that variables Xu and x-^ are conditionally independent given all other variables (Roller and 
Friedman, 2009). We would like to maximize the likelihood subject to a penalization on 
the number of edges since this results in a sparse graphical model. In the pairwise mixed 
model, there are 3 type of edges 

1. Pst is a scalar that corresponds to an edge from Xg to xt- /3st = 0 implies Xg and 
xt are conditionally independent given all other variables. This parameter is in two 
conditional distributions, corresponding to either Xg or xt is the response variable, 
p{xs\x\g,y;Q) and p{xt\x\t,y;e). 

2. psj is a vector of length Lj. If Pgj{yj) = 0 for all values of yj, then yj and Xg 
are conditionally independent given all other variables. This parameter is in two 
conditional distributions, corresponding to either Xg or yj being the response variable: 
p{xg\x\g, y; 0) and piyj\x, y\j; 0). 

3. 4>rj is a matrix of size x Lj. If (t>rj{yr,yj) = 0 for all values of yr and yj, then 
yr and yj are conditionally independent given all other variables. This parameter is 
in two conditional distributions, corresponding to either yj. or yj being the response 
variable, p{yr\x, y\r] &) and p{yj\x,y\j-,B). 

For conditional independencies that involve discrete variables, the absence of that edge 
requires that the entire matrix (p^j or vector pgj is 0 The form of the pairwise mixed 

^If PsjiVj) = constant, then Xs and yj are also conditionally independent. However, the unpenalized 
term a will absorb the constant, so the estimated Psj{yj) will never be constant for A > 0. 
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Figure 4.1: Symmetric matrix represents the parameters 0 of the model. This example has p = 3, 
<7 = 2, Li = 2 and L 2 = 3. The red square corresponds to the continuous graphical model coefficients 
B and the solid red square is the scalar fist ■ The blue square corresponds to the coefficients psj and the 
solid blue square is a vector of parameters Psji’)- The orange square corresponds to the coefficients 
(prj and the solid orange square is a matrix of parameters (frji’, ■)■ The matrix is symmetric, so each 
parameter block appears in two of the conditional probability regressions. 


model motivates the following regularized optimization problem 


minmize 4(0) = £{0) + A | ^ l[/3st / 0] + ^ l[psj ^ 0] + ^ l[(j)rj ^ 0] 1 . (4.4.1) 

\s<t sj r<j j 

All parameters that correspond to the same edge are grouped in the same indicator function. 
This problem is non-convex, so we replace the lo sparsity and group sparsity penalties with 
the appropriate convex relaxations. For scalars, we use the absolute value (^i norm), for 
vectors we use the I 2 norm, and for matrices we use the Frobenius norm. This choice 
corresponds to the standard relaxation from group Iq to group I 1 /I 2 (group lasso) norm 
(Bach et ah, 2011; Yuan and Lin, 2006), 

( p s-l 

eeia. 

s=l i=l 

4.5 Calibrated regularizers 

In (4.4.2) each of the group penalties are treated as equals, irrespective of the size of the 
group. We suggest a calibration or weighting scheme to balance the load in a more equitable 


s=l jf=l jf=l r=l 


. (4.4.2) 
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way. We introduce weights for each group of parameters and show how to choose the weights 
such that each parameter set is treated equally under pp, the fully-factorized independence 
model ^ 


p s-i p <? 9 i-i 

minimize £{Q) + A lEE EE '^sj \\psj II 2 EE Wrj 

t=l t=l s=l j=l j=l r=l 


II p 


(4.5.1) 


Based on the KKT conditions (Friedman et ah, 2007), the parameter group 9g is non-zero 
if 


de 


de„ 


> XWn 


where and Wg represents one of the parameter groups and its corresponding weight. Now 


can be viewed as a generalized residual, and for different groups these are different 
dimensions—e.g. scalar/vector/matrix. So even under the independence model (when all 
terms should be zero), one might expect some terms ^ to have a better than random 
chance of being non-zero (for example, those of bigger dimensions). Thus for all parameters 
to be on equal footing, we would like to choose the weights w such that 


-'Pf 


dl 


de„ 


= constant x Wg, 


(4.5.2) 


where pp is the fully factorized (independence) model. We will refer to these as the exact 

0 we propose an approxima- 
, so we may use approximate 


weights. These weights do not have a closed form expression, so we propose an approxima- 

ai ^ 


tion to these. It is simpler to compute in closed form Ep^ 
weights 


dd„ 


w„ oc 




PF 


d£ 


36, 


(4.5.3) 


^Under the independence model pf is fully-factorized p{x,y) = ris=iP(*s) Ylr^iPiUr) 
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at 

d4>i2 

F 


dl 

dpii 

? 


di 

dp2i 

? 


d£ 

dpi2 

2 


ae 

dp22 

2 


ai 

apu 


Exact weights Wg (4.5.2) 

0.18 

0.63 


0.19 


0.47 

0.15 

0.53 

Approximate weights Wg (4.5.4) 

0.13 

0.59 

0.18 

0.44 

0.13 

0.62 


Figure 4.2: Row 1 shows the exact weights Wg computed via Equation (4.5.2) using Monte 
Carlo simulation. These are the ideal weights, but they are not available in closed-form. 
Row 2 shows the approximate weights computed using Equation (4.5.4). As we can see, the 
weights are far from uniform, and the approximate weights are close to the exact weights. 


In the supplementary material, we show that the approximate weights (4.5.4) are 


'^st — 




(4.5.4) 


Wrj = 


I'^Pail - Pa) ^9b(l - Qb) 


as is the standard deviation of the continuous variable Xg- Pa = Pr{yr = a) and qb = 

Pr{yj = h) . For all 3 types of parameters, the weight has the form of Wuv = tr(cov( 2 :„))tr(cov(z„)), 
where z represents a generic variable and cov( 2 ) is the variance-covariance matrix of z. 

We conducted a simulation study to show that calibration is needed. Consider a model 
with 4 independent variables: 2 continuous with variance 10 and 1, and 2 discrete variables 
with 10 and 2 levels. 

There are 6 candidate edges in this model and from row 1 of Table 4.2 we can see the 
sizes of the gradients are different. In fact, the ratio of the largest gradient to the smallest 
gradient is greater than 4. The edges pu and pi 2 involving the first continuous variable with 
variance 10 have large edge weights, than the corresponding edges, p 2 i and P 22 involving the 
second continuous variable with variance 1. Similarly, the edges involving the hrst discrete 
variable with 10 levels are larger than the edges involving the second discrete variable with 
2 levels. This reflects our intuition that larger variance and longer vectors will have larger 
norm. 

Had the calibration weights been chosen via Equation 4.5.2, w = {wg}g and the vector 


of gradients Vi = { ^ }g would have cosine similarity, sim{u, v) = 


= 1. The 
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Pii 

Pl2 

4>12 

No Calibration Wg = 1 

0.1350 

0.7280 

0.1370 

Exact Wg (4.5.2) 

0.3180 

0.3310 

0.3510 

Approximate Wg (4.5.4) 

0.2650 

0.2650 

0.4700 


Table 4.1: Frequency an edge is the first selected by the group lasso regularizer. The group 
lasso with equal weights is highly unbalanced, as seen in row 1. The weighing scheme with 
the weights from (4.5.2) is very good, and selects the edges with probability close to the 
ideal The approximate weighing scheme of (4.5.4) is an improvement over not calibrating; 
however, not as good as the weights from (4.5.2). 

approximate weights we used are from Equation (4.5.4) and have cosine similarity 

sim(w, V£) = .993, 

which is extremely close to 1. Thus the calibration weights are effective in accounting for 
the size and variances of each edge group. 

In the second simulation study, we used a model with 3 independent variables: one 
continuous, and 2 discrete variables with 2 and 4 levels. There are 3 candidate edges, 
and we computed the probability that a given edge would be the first allowed to enter the 
model using 3 different calibration schemes. From Table 4.1, we see that the uncalibrated 
regularizer would select the edge between the continuous variable and the 4 level discrete 
variable about 73% of the time. A perfect calibration scheme would select each edge 33% 
of the time. We see that the two proposed calibration schemes are an improvement over 
the uncalibrated regularizer. 

The exact weights do not have a simple closed form expression, but they can be easily 
computed via Monte Carlo. This can be done by simulating independent Gaussians and 
multinomials with the appropriate marginal variance as and marginal probabilities pa, then 
approximating the expectation in (4.5.2) by an average. The computational cost of this 
procedure is negligible compared to fitting the mixed model, so the exact weights can also 
be used. 

4.6 Model Selection Consistency 

In this section, we study the model selection consistency, whether the correct edge set is 
selected and the parameter estimates are close to the truth, of the pseudolikelihood and 
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maximum likelihood estimators. Consistency can be established using the framework hrst 
developed in Ravikumar et al. (2010) and later extended to general M-estimators by Lee 
et al. (2013b). Instead of stating the full results and proofs, we will illustrate the type 
of theorems that can be shown and defer the rigorous statements to the Supplementary 
Material. 

First, we dehne some notation. Recall that 0 is the vector of parameters being es¬ 
timated {Pss, Pst, cus, (prj, Psj}, 0* be the true parameters that estimated the model, and 
Q = Both maximum likelihood and pseudolikelihood estimation procedures can 

be written as a convex optimization problem of the form 


minimize £(0) -|- A E ii®»ii 2 (“'i) 

geG 

where £(9) = {£ml, £pl} is one of the two log-likelihoods. The regularizer 


Eii®9 


P S —1 


<? i-1 




Grj IliT’ 


, s=l t=l 


s=l j=l 


j=l r=l 


The set G indexes the edges Pst, Psj, and prj, and Qg is one of the three types of edges. 
Let A and I represent the active and inactive groups in 0, so 0* 7 ^: 0 for any g ^ A and 
0* = 0 for any g ^ I. 

Let 0 be the minimizer to Equation (4.6.1). Then 0 satishes, 


1 . 


0 - 0 ^ 


< C 


|A| log|G| 
n 


2 . 0g = 0 for g £ I. 


The exact statement of the theorem is given in the Supplementary Material. 


4.7 Optimization Algorithms 

In this section, we discuss two algorithms for solving (4.4.2): the proximal gradient and the 
proximal newton methods. This is a convex optimization problem that decomposes into the 
form f{x) + g{x), where / is smooth and convex and g is convex but possibly non-smooth. 
In our case / is the negative log-likelihood or negative log-pseudolikelihood and g are the 
group sparsity penalties. 
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Block coordinate descent is a frequently used method when the non-smooth function 
g is the li or group li. It is especially easy to apply when the function / is quadratic, 
since each block coordinate update can be solved in closed form for many different non¬ 
smooth g (Friedman et ah, 2007). The smooth / in our particular case is not quadratic, so 
each block update cannot be solved in closed form. However in certain problems (sparse 
inverse covariance), the update can be approximately solved by using an appropriate inner 
optimization routine (Friedman et ah, 2008b). 

4.7.1 Proximal Gradient 

Problems of this form are well-suited for the proximal gradient and accelerated proximal 
gradient algorithms as long as the proximal operator of g can be computed (Combettes and 
Pesquet, 2011; Beck and Teboulle, 2010) 

1 2 

proxt(x) = argmin — ||x — u\\ + g{u) (l-l-l) 

u 

For the sum of h group sparsity penalties considered, the proximal operator takes the 
familiar form of soft-thresholding and group soft-thresholding (Bach et ah, 2011). Since the 
groups are non-overlapping, the proximal operator simplifies to scalar soft-thresholding for 
/3st and group soft-thresholding for psj and (prj- 

The class of proximal gradient and accelerated proximal gradient algorithms is directly 
applicable to our problem. These algorithms work by solving a first-order model at the 
current iterate Xk 

argmin f{xk) + Vf{xkf{u - Xk) + ^ ||tt - Xk\\^ + g{u) (4.7.2) 

U 

= argmin ^ \\u - {xk - tXf{xk))f + giu) (4.7.3) 

= proxt{xk - tVf{xk)) (4.7.4) 

The proximal gradient iteration is given by Xk+i = proxt{xk — tXf{xk)) where t is de¬ 
termined by line search. The theoretical convergence rates and properties of the proximal 
gradient algorithm and its accelerated variants are well-established (Beck and Teboulle, 
2010). The accelerated proximal gradient method achieves linear convergence rate of O(c^) 
when the objective is strongly convex and the sublinear rate 0(1/k'^) for non-strongly convex 
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problems. 

The TFOCS framework (Becker et ah, 2011) is a package that allows us to experiment 
with 6 different variants of the accelerated proximal gradient algorithm. The TFOCS au¬ 
thors found that the Auslender-Teboulle algorithm exhibited less oscillatory behavior, and 
proximal gradient experiments in the next section were done using the Auslender-Teboulle 
implementation in TFOCS. 

4.7.2 Proximal Newton Algorithms 

The class of proximal Newton algorithms is a 2nd order analog of the proximal gradient 
algorithms with a quadratic convergence rate (Lee et ah, 2012; Schmidt, 2010; Schmidt 
et ah, 2011). It attempts to incorporate 2nd order information about the smooth function 
/ into the model function. At each iteration, it minimizes a quadratic model centered at Xk 

argmin f(xk) -h Vf(xk)'^(u - Xk) + ^{u - XkfH{u - Xk) + g{u) (4.7.5) 

= argmin ^ [u - Xk + tH~^Vf{xk))'^ H [u - Xk + tH~^Xf{xk)) + g{u) (4.7.6) 

U 

= argmin ^\\u - {xk - tH~^Xf{xk))\\]j + g{u) (4.7.7) 

:= Hproxt {xk — tH~^Vf{xk)) where H = V‘^f{xk) (4.7.8) 

The Hprox operator is analogous to the proximal operator, but in the H-Hji^-norm. It 

Algorithm 3 Proximal Newton 
repeat 

Solve subproblem pk = Hproxt {xk — tHjl^Vf{xk)) — Xk using TFOCS. 

Find t to satisfy Armijo line search condition with parameter a 

f{xk + tpk) +g{xk + tpk) < f{xk) +g{xk) - Y Wpkf 

Set Xk-\-l — Xk “h tpk 

k = k + l 

until < tol 


simplifies to the proximal operator if 77 = /, but in the general case of positive definite H 
there is no closed-form solution for many common non-smooth g{x) (including li and group 
h). However if the proximal operator of g is available, each of these sub-problems can be 
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solved efficiently with proximal gradient. In the case of separable g, coordinate descent is 
also applicable. Fast methods for solving the subproblem Hproxt{xk—tH~^V f{xk)) include 
coordinate descent methods, proximal gradient methods, or Barzilai-Borwein (Friedman 
et ah, 2007; Combettes and Pesquet, 2011; Beck and Teboulle, 2010; Wright et ah, 2009). 
The proximal Newton framework allows us to bootstrap many previously developed solvers 
to the case of arbitrary loss function /. 

Theoretical analysis in Lee et al. (2012) suggests that proximal Newton methods gen¬ 
erally require fewer outer iterations (evaluations of Hprox) than first-order methods while 
providing higher accuracy because they incorporate 2nd order information. We have con¬ 
firmed empirically that the proximal Newton methods are faster when n is very large or the 
gradient is expensive to compute (e.g. maximum likelihood estimation). Since the objective 
is quadratic, coordinate descent is also applicable to the subproblems. The hessian matrix 
H can be replaced by a quasi-newton approximation such as BFGS/L-BFGS/SRl. In our 
implementation, we use the PNOPT implementation (Lee et ah, 2012). 

4.7.3 Path Algorithm 

Frequently in machine learning and statistics, the regularization parameter A is heavily 
dependent on the dataset. A is generally chosen via cross-validation or holdout set per¬ 
formance, so it is convenient to provide solutions over an interval of [Xmin, ^max]- We 
start the algorithm at Ai = Xmax and solve, using the previous solution as warm start, for 
A 2 > ... > Xmin- We find that this reduces the cost of fitting an entire path of solutions 
(See Figure 4.5). Xmax can be chosen as the smallest value such that all parameters are 0 
by using the KKT equations (Friedman et ah, 2007). 

4.8 Conditional Model 

In addition to the variables we would like to model, there are often additional features or 
covariates that affect the dependence structure of the variables. For example in genomic 
data, in addition to expression values, we have attributes associated to each subject such 
as gender, age and ethnicity. These additional attributes affect the dependence of the 
expression values, so we can build a conditional model that uses the additional attributes 
as features. In this section, we show how to augment the pairwise mixed model with features. 

Gonditional models only model the conditional distribution p{z\f)^ as opposed to the 
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joint distribution p(z, /), where z are the variables of interest to the prediction task and / 
are features. These models are frequently used in practice Lafferty et al. (2001). 

In addition to observing x and y, we observe features / and we build a graphical model 
for the conditional distribution p{x,y\f). Consider a full pairwise model p{x,y,f) of the 
form (4.2.1). We then choose to only model the joint distribution over only the variables x 
and y to give us p{x, y\f) which is of the form 


p{x,y\f-,e) 


1 


p p 


, S=1 t=l 


P Q 


exp (EE -i^PstXsXt + ^ asXs + EE Psj{yj)xs 

S=1 j = l 


s=l 


q j F p F q 

+ EE (j^rj iUrj Uj ) T EE llsXsfl + EE mriyr)fl 

j=l r=l 1=1 s=l 1=1 r=l 


(4.8.1) 


We can also consider a more general model where each pairwise edge potential depends on 
the features 


p p 


p{x,y\f]Q) = 


zm) 


exp EE -F^st{f)XsXt + E“.(/)^- 


\s = l t=l 
p <? 


S=1 


9 J 


+EE Psjiyj: f)Xs + EE 4’rj{yri yji f) 

s=l j=l j=l r=l 


(4.8.2) 


(4.8.1) is a special case of this where only the node potentials depend on features and the 
pairwise potentials are independent of feature values. The specific parametrized form we 
consider is (l)rj{yr,yj, f) = 4’rj{yr,yj) for r / j, psj{yj,f) = Psj{yj), and /3st(/) = /3st- 
The node potentials depend linearly on the feature values, ««(/) = lisXsfu and 

(t>rr{yri yri /) — 4^rriyri Pr) T Xyi VlriPr)- 


4.9 Experimental Results 

We present experimental results on synthetic data, survey data and on a conditional model. 
4.9.1 Synthetic Experiments 

In the synthetic experiment, the training points are sampled from a true model with 10 
continuous variables and 10 binary variables. The edge structure is shown in Figure 4.3a. 
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A is chosen proportional to y as suggested by the theoretical results in Section 

4.6. We experimented with 3 values A = {1, 5, and chose A = so 

that the true edge set was recovered by the algorithm for the sample size n = 2000. We 
see from the experimental results that recovery of the correct edge set undergoes a sharp 
phase transition, as expected. With n = 1000 samples, the pseudolikelihood is recovering 
the correct edge set with probability nearly 1. The maximum likelihood was performed 
using an exact evaluation of the gradient and log-partition. The poor performance of the 
maximum likelihood estimator is explained by the maximum likelihood objective violating 
the irrepresentable condition; a similar example is discussed in (Ravikumar et ah, 2010, 
Section 3.1.1), where the maximum likelihood is not irrepresentable, yet the neighborhood 
selection procedure is. The phase transition experiments were done using the proximal 
Newton algorithm discussed in Section 4.7.2. 

We also run the proximal Newton algorithm for a sequence of instances with p = q = 
10, 50,100, 500,1000 and n = 500. The largest instance has 2000 variables and takes 12.5 
hours to complete. The timing results are summarized in Figure 4.4. 

4.9.2 Survey Experiments 

The census survey dataset we consider consists of 11 variables, of which 2 are continuous and 
9 are discrete: age (continuous), log-wage (continuous), year(7 states), sex(2 states),marital 
status (5 states), race(4 states), education level (5 states), geographic region(9 states), job 
class (2 states), health (2 states), and health insurance (2 states). The dataset was assembled 
by Steve Miller of OpenBI.com from the March 2011 Supplement to Current Population 
Survey data. All the evaluations are done using a holdout test set of size 100,000 for the 
survey experiments. The regularization parameter A is varied over the interval [5 x 10“^, 0.7] 
at 50 points equispaced on log-scale for all experiments. In practice, A can be chosen to 
minimize the holdout log pseudolikelihood. 

Model Selection 

In Figure 4.5, we study the model selection performance of learning a graphical model over 
the 11 variables under different training samples sizes. We see that as the sample size 
increases, the optimal model is increasingly dense, and less regularization is needed. 
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(b) 

Figure 4.3: Figure 4.3a shows the graph used in the synthetic experiments for p = q = 4; the 
experiment actually used p=10 and q=10. Blue nodes are continuous variables, red nodes are binary 
variables and the orange, green and dark blue lines represent the 3 types of edges. Figure 4-3b is a 
plot of the probability of correct edge recovery, meaning every true edge is selected and no non-edge 
is selected, at a given sample size using Maximum Likelihood and Pseudolikelihood. Results are 
averaged over 100 trials. 


p + q 

Time per Iteration (sec) 

Total Time (min) 

Number of Iterations 

20 

.13 

.003 

13 

100 

4.39 

1.32 

18 

200 

18.44 

6.45 

21 

1000 

245.34 

139 

34 

2000 

1025.6 

752 

44 


Figure 4.4: Timing experiments for various instances of the graph in Figure 4.3a. The 
number of variables range from 20 to 2000 with n = 500. 


















CHAPTER 4. LEARNING MIXED GRAPHICAL MODELS 


79 



Figure 4.5: Model selection under different training set sizes. Circle denotes the lowest test set 
negative log pseudolikelihood and the number in parentheses is the number of edges in that model at 
the lowest test negative log pseudolikelihood. The saturated model has 55 edges. 

Comparing against Separate Regressions 

A sensible baseline method to compare against is a separate regression algorithm. This 
algorithm fits a linear Gaussian or (multiclass) logistic regression of each variable condi¬ 
tioned on the rest. We can evaluate the performance of the pseudolikelihood by evaluating 
— logp{xs\x\s^y) for linear regression and — logp{yr\y\r,x) for (multiclass) logistic regres¬ 
sion. Since regression is directly optimizing this loss function, it is expected to do better. 
The pseudolikelihood objective is similar, but has half the number of parameters as the sep¬ 
arate regressions since the coefficients are shared between two of the conditional likelihoods. 
From Figures 4.6 and 4.7, we can see that the pseudolikelihood performs very similarly to 
the separate regressions and sometimes even outperforms regression. The benefit of the 
pseudolikelihood is that we have learned parameters of the joint distribution p{x, y) and 
not just of the conditionals p{xs\y, x\s)- the test dataset, we can compute quantities such 
as conditionals over arbitrary sets of variables p(yy 4 , xs|?/^c, x^c) and marginals p{xA,yB) 
(Koller and Friedman, 2009). This would not be possible using the separate regressions. 

Conditional Model 

Using the conditional model (4.8.1), we model only the 3 variables logwage, education(5) 
and jobclass(2). The other 8 variables are only used as features. The conditional model 
is then trained using the pseudolikelihood. We compare against the generative model that 
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Separate — — — Joint 

Full Model age logwage 






jobclass 


health 


health ins. 




Regularization Parameter Log-Scale 


Figure 4.6: Separate Regression vs Pseudolikelihood n = 100. y-axis is the appropriate regression 
loss for the response variable. For low levels of regularization and at small training sizes, the pseu¬ 
dolikelihood seems to overfit less; this may be due to a global regularization effect from fitting the 
joint distribution as opposed to separate regressions. 
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' Separate 


Joint 
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logwage 
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Figure 4.7: Separate Regression vs Pseudolikelihood n = 10,000. y-axis is the appropriate regres¬ 
sion loss for the response variable. At large sample sizes, separate regressions and pseudolikelihood 
perform very similarly. This is expected since this is nearing the asymptotic regime. 
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Regularization Parameter Log-Scale 


Figure 4.8: Conditional Model vs Generative Model at various sample sizes, y-axis is test set 
performance is evaluated on negative log pseudolikelihood of the conditional model. The conditional 
model outperforms the full generative model at except the smallest sample size n = 100. 


learns a joint distribution on all 11 variables. From Figure 4.8, we see that the conditional 
model outperforms the generative model, except at small sample sizes. This is expected 
since the conditional distribution models less variables. At very small sample sizes and 
small A, the generative model outperforms the conditional model. This is likely because 
generative models converge faster (with less samples) than discriminative models to its 
optimum. 

Maximum Likelihood vs Pseudolikelihood 

The maximum likelihood estimates are computable for very small models such as the con¬ 
ditional model previously studied. The pseudolikelihood was originally motivated as an ap¬ 
proximation to the likelihood that is computationally tractable. We compare the maximum 
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likelihood and maximum pseudolikelihood on two different evaluation criteria: the negative 
log likelihood and negative log pseudolikelihood. In Figure 4.9, we hnd that the pseu- 
dolikelihood outperforms maximum likelihood under both the negative log likelihood and 
negative log pseudolikelihood. We would expect that the pseudolikelihood trained model 
does better on the pseudolikelihood evaluation and maximum likelihood trained model does 
better on the likelihood evaluation. However, we found that the pseudolikelihood trained 
model outperformed the maximum likelihood trained model on both evaluation criteria. 
Although asymptotic theory suggests that maximum likelihood is more efficient than the 
pseudolikelihood, this analysis is inapplicable because of the hnite sample regime and mis- 
specified model. See Liang and Jordan (2008) for asymptotic analysis of pseudolikelihood 
and maximum likelihood under a well-specified model. We also observed the pseudolikeli- 
hood slightly outperforming the maximum likelihood in the synthetic experiment of Figure 
4.3b. 

4.10 Conclusion 

This work proposes a new pairwise mixed graphical model, which combines the Gaussian 
graphical model and discrete graphical model. Due to the introduction of discrete variables, 
the maximum likelihood estimator is computationally intractable, so we investigated the 
pseudolikelihood estimator. To learn the structure of this model, we use the appropriate 
group sparsity penalties with a calibrated weighing scheme. Model selection consistency 
results are shown for the mixed model using the maximum likelihood and pseudolikelihood 
estimators. The extension to a conditional model is discussed, since these are frequently 
used in practice. 

We proposed two efficient algorithms for the purpose of estimating the parameters of 
this model, the proximal Newton and the proximal gradient algorithms. The proximal 
Newton algorithm is shown to scale to graphical models with 2000 variables on a standard 
desktop. The model is evaluated on synthetic and the current population survey data, 
which demonstrates the pseudolikelihood performs well compared to maximum likelihood 
and nodewise regression. 

For future work, it would be interesting to incorporate other discrete variables such as 
poisson or binomial variables and non-Gaussian continuous variables. This would broaden 
the scope of applications that mixed models could be used for. Our work is a hrst step in 


Negative Log Likelihood Negative Log PL 


CHAPTER 4. LEARNING MIXED GRAPHICAL MODELS 


84 


ML training — — — PL training 


n=100 


n=500 


n=1000 





n=100 


n=500 


n=1000 





Regularization Parameter Log-Soale 


Figure 4.9: Maximum Likelihood vs Pseudolikelihood, y-axis for top row is the negative log pseu¬ 
dolikelihood. y-axis for bottom row is the negative log likelihood. Pseudolikelihood outperforms max¬ 
imum likelihood across all the experiments. 
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that direction. 
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Supplementary Materials 

4.10.1 Proof of Convexity 

Proposition 4.3.1. The negative log pseudolikelihood in (4.3.3) is jointly convex in all the 
parameters {fiss, (dst,Cis,(t>rj, Psj] over the region /3ss > 0. 


Proof. To verify the convexity of £(0|x, y), it suffices to check that each term is convex. 

— logp{yr\y\r^,x] 0) is jointly convex in p and (j) since it is a multiclass logistic regression. 
We now check that — logp(xs|x\ 5 , y; 0) is convex. —^log/3ss is a convex function. To 
establish that 

PsjiVj) f^st 



Ps 


t^S 


Ps 


-Xt - Xs 


is convex, we use the fact that f{u,v) = 5 (y — c)^ is convex. Let v = Pss, u = as + 

Ylj PsjiVj) ~ Ylt^s PstXt, and c = Xg- Notice that Xg, ag, yj, and xt are fixed quantities and 

u is affinely related to Pgt and psj. A convex function composed with an affine map is still 

2 


convex, thus 




psj{Vj ) 


-E 


Bst 

" ^ rp _ rp 


IS convex. 


To finish the proof, we verify that f{u, v) = — c)^ = IPlAAD convex over u > 0. 


2^v ' 2 V 

The epigraph of a convex function is a convex set iff the function is convex. Thus we establish 
that the set C = {(rt, v,t)\ \ > 0} is convex. Let A = 


V u — cv 
u — cv t 


. The 


/ \2 

Schur complement criterion of positive definiteness says A 0 iff u > 0 and t > LL_£LL_ 
The condition A 0 is a linear matrix inequality and thus convex in the entries of A. The 
entries of A are linearly related to u and u, so A 0 is also convex in u and v. Therefore 
u > 0 and t > 


is a convex set. 


□ 


4.10.2 Sampling Prom The Joint Distribution 

In this section we discuss how to draw samples (x,y) ~ p{x,y). Using the property that 
p{x, y) = p{y)p{x\y), we see that if y ~ p{y) and x ~ p{x\y) then (x, y) ~ p(x, y). We have 
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that 



(4.10.1) 


ip(y))s = '^psj{yj) 


(4.10.2) 


j 


p{x\y) = No{B ^{a + p{y)),B ^) 


(4.10.3) 


The difficult part is to sample y ~ p{y) since this involves the partition function of the 
discrete MRF. This can be done with MCMC for larger models and junction tree algorithm 
or exact sampling for small models. 

4.10.3 Maximum Likelihood 


The difficulty in MLE is that in each gradient step we have to compute T{x, y)—Epi^Q'^ y)] 
the difference between the empirical sufficient statistic T{x,y) and the expected sufficient 


statistic. In both continuous and discrete graphical models the computationally expensive 
step is evaluating [T{x, y)]. In discrete problems, this involves a sum over the discrete 

state space and in continuous problem, this requires matrix inversion. For both discrete 
and continuous models, there has been much work on addressing these difficulties. For dis¬ 
crete models, the junction tree algorithm is an exact method for evaluating marginals and 
is suitable for models with low tree width. Variational methods such as belief propagation 
and tree reweighted belief propagation work by optimizing a surrogate likelihood function 
by approximating the partition function Z{Q) by a tractable surrogate Z(Q) Wainwright 
and Jordan (2008). In the case of a large discrete state space, these methods can be used 
to approximate p{y) and do approximate maximum likelihood estimation for the discrete 
model. Approximate maximum likelihood estimation can also be done via Monte Carlo esti¬ 
mates of the gradients T{x, y) — Ep(^Q-^(T{x, y)). For continuous Gaussian graphical models, 
efficient algorithms based on block coordinate descent Friedman et al. (2008b); Banerjee 
et al. (2008) have been developed, that do not require matrix inversion. 
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The joint distribution and loglikelihood are: 

p(x,y;0) = exp(-^x^Bx + (a + p(y))'^x + '^(f>rj(yr,yj))/Z(e) 

(rj) 

^(0) = (\x^Bx - (a + p{y))^X - ^ (prj{yr, 2/i)^ 


dxexp {-^-x'^Bx + {a + p{y'))'^x) exp{^ y'))) 


The derivative is 


di I 

m = + 


/ ‘‘M'Ly exp{-p^Bx + (a + p{y)px + ipriii'rxy'j))) 


= -xx^+ / '^{--xx'^p{x,y'-,Q)) 
y' 

= -xx^ + ^ / --xx'^p{x\y']e)p{y') 
y' 

= \xx^ + J + B~^{a + p{y )){a + p{y'Y')B-^) p{y) 
y' 

The primary cost is to compute B~^ and the sum over the discrete states y. 

The computation for the derivatives of i{Q) with respect to psj and (prj are similar. 


(®) b') 


= - 1 (,, = a,y, = 6 ) + = a,y', = k)p(x.y';B) 

y 

= -l(yr = a, yj = 6) + ^ l(y^ = a, y' = b)p{y') 

y' 


The gradient requires summing over all discrete states. 
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Similarly for Psj(a)- 


di _ 
= -l(2/i 


l{yj = a)xs + X] / = a)xs)pix', y'; 0) 


MLE estimation requires summing over the discrete states to compute the expected sufficient 
statistics. This may be approximated using using samples {x,y) ~ p{x,y;Q). The method 
in the previous section shows that sampling is efficient if y ~ p{y) is efficient. This allows 
us to use MCMC methods developed for discrete MRF’s such as Gibbs sampling. 


4.10.4 Choosing the Weights 

We first show how to compute Wgj- The gradient of the pseudo-likelihood with respect to 
a parameter Psj{a) is given below 


di 

dpsj{a) 


^-2 X l[y] = a]x\ + Epp{l[yj = a]xs\y\j,x'') + Epp{l[yj = a\xs\xy^,y") 

i=l 

n 

^-2x1 [y] = a\x\ + x\p{yj = a) -M [y] = a] ps 
i=l 
n 

^ \y) = =«) -1 [y] =«]) 

i=l 

n 

[yj = “ p(yj = «)) (A^ - xl) + (x* - fis) {p{yj = a) - 1 [y* = a] ) 

i=l 

(4.10.4) 

n 

^ 2 (l [y] = a] - p{yj = a)) {ps - x*) 
i=l 


(4.10.5) 
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Since the subgradient condition includes a variable if > A, we compute E ^ 

By independence, 


= 4nEpp (||l[y* = a] - p{yj = a)\f'^ Ep^ (^||/1, - x* (4.10.7) 

= 4(n - l)p{yj = a){l-p{yj = a))cj^ (4.10.8) 


{l[yj = a] - p{yj = a)) {ps - x*) 


(4.10.6) 


The last line is an equality if we replace the sample means p and p with the true values p and 

2 

p. Thus for the entire vector psj we have Epp ^0- = 4(n—1) (Y^aPiVi — ~ PiVj — ®)) ^s- 

If we let the vector 2 : be the indicator vector of the categorical variable yj , and let the vector 

2 

P = PiVj = a), then Epp 0- = 4{n - 1) = 4(n - l)tr(cov( 2 ;))var(x) 

and Wsj = VT^aPaC^ - Pa)(^s- 

We repeat the computation for /Igt- 

n 

^ -2x*xt + Epp{xlx\\x\s, y) + Epp{xlxl\x\t, y) 

i=\ 
n 

-2x*xJ + psxl + ptxl 

i=l 
n 

Y ~ 

i=\ 
n 

Y^^'t ~ - fis){pt - x\) 

i=\ 
n 

Y‘^ixl - Pt){Ps - xl) 

i=l 

Thus 

^ ^ 2(xJ - pt)iPs - xl) 

= 4nEpp \\xt - p0 Epp ||x<i - ps 
= 4(n - l)a^a^ 



di 

dPst 
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= 4{n — l)cr^crf and taking square-roots gives us Wgt = CTsf^t- 
We repeat the same computation for (prj. Let pa = Pr{yr = a) and qi, = Pr{yj = b). 

di 

Q^^.(^a,b) (Ibr- = a]l[yj = b]\y\r,x) 

+ E {l[yr = a]l[yj = b]\y\j,x) 

n 

= ^ -1 [yl = a]l [yj = b] + pal [y} = b] + qbl [yl = a] 

i=l 

n 

= [yj = 1 [^r = «] ) + 1 [^r = «] {% - 1 [Vj = b] ) 

i=l 

n 

= \y) = ^] “ ^b){Pa - 1 [y* = a]) -h (1 [y]. = a] - Pa){qb - 1 [y] = b]) 

i=l 

n 

= iyj = -1 [i/r = “]) 

i=l 


Thus E; 


PF 


de 


dl3s 


Thus we compute 


^PF 


di 


d(f)rj{a, b) 


= E 


X] \yj = - 1 br = «] ) 


2 = 1 


= AnEpp 11% - l[yj = b]f Ep^ \\pa - l[yr = a]f 
= 4(n - 1)®(1 - qb)Pa{I - Pa) 


From this, we see that E, 


PF 


dl 


d4>r 




n 


w. 


rj 


= \/Ea=l Ebil Qbil - qb)Pa{l - Pa)- 


l)qb{l - qb)Pa{I -Pa) and 


4.10.5 Model Selection Consistency 

One of the difficulties in establishing consistency results for the problem in Equation (4.6.1) 
is due to the non-identifiability of the parameters. £(0) is constant with respect to the 
change of variables Pgjiyj) = PsjiVj) + c and similarly for cj), so we cannot hope to recover 
0*. A popular fix for this issue is to drop the last level of p and (/>, so they are only indicators 
over L — 1 levels instead of L levels. This allows for the model to be identifiable, but it 
results in an asymmetric formulation that treats the last level differently from other levels. 
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Instead, we will maintain the symmetric formulation by introducing constraints. Consider 
the problem 

minimize t'(0) + A Elie»ll 2 

9 SG (4,10.9) 

subject to CO = 0. 


The matrix C constrains the optimization variables such that 


= 0 
Vj 

^ ^ 4‘rj{yrj yj) — 0 . 

Vj 

The group regularizer implicitly enforces the same set of constraints, so the optimization 
problems of Equation (4.10.9) and Equation (4.6.1) have the same solutions. For our the¬ 
oretical results, we will use the constrained formulation of Equation (4.10.9), since it is 
identifiable. 

We first state some definitions and two assumptions from Lee et al. (2013b) that are 
necessary to present the model selection consistency results. Let A and I represent the 
active and inactive groups in 0, so 0* 7 ^ 0 for any g ^ A and 0* = 0 for any g G I. The 
sets associated with the active and inactive groups are defined as 


Al = {0 G IR'^ : max || 0 g ||2 < 1 and || 0 g ||2 = 0 , 5 G /} 
g&G 

X = {0 G : max || 0 g ||2 < 1 and || 0 g ||2 = 0, g G ^4}. 

g&G 

Let M = spaniX)^ n Null{C) and Pm be the orthogonal projector onto the subspace M. 
The two assumptions are 

1. Restricted Strong Convexity. We assume that 

v'^V^£{e)v 


sup „ 

veM V 


> m 


(4.10.10) 


for all ||0 — 0*||2 < Since V^£(0) is lipschitz continuous, the existence of a constant 
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m that satisfies (4.10.10) is implied by the pointwise restricted convexity 

sup-- > m. 

veM V 

For convenience, we will use the former. 

2. Ir represent able condition. There exist r G (0,1) such that 


sup F(Pm^(v2£(0*)Pm(TmV2£(0*)Pm)+Pm^ - z)) < 1 - t, (4.10.11) 
zeA 

where V is the infimal convolution of the gauge pi, X pi{x) = inf{t : t > 0,tx G X}, 
and 1 [NuII{C)-^] : 


Viz) 


inf {pi{ui) + 1 NulliCy 
2 = 111+112 1 


(^ 2 )}. 


Restricted strong convexity is a standard assumption that ensures the parameter 0 is 
uniquely determined by the value of the likelihood function. Without this, there is no 
hope of accurately estimating 0*. It is only stated over a subspace M which can be much 
smaller than IR'^. The Ir represent able condition is a more stringent condition. Intuitively, 
it requires that the active parameter groups not be overly dependent on the inactive pa¬ 
rameter groups. Although the exact form of the condition is not enlightening, it is known 
to be necessary for model selection consistency in lasso-type problems (Zhao and Yu, 2006; 
Lee et ah, 2013b) and a common assumption in other works that establish model selection 
consistency (Ravikumar et ah, 2010; Jalali et ah, 2011; Peng et ah, 2009). We also define 
the constants that appear in the theorem: 

1. Lipschitz constants Li and L 2 . Let A(0) be the log-partition function. A(0) and 
^(0) are twice continuously differentiable functions, so their gradient and hessian are 
locally Lipschitz continuous in a ball of radius r around 0*: 


||VA(0i) - VA(02)||2 < Li ||0i - 02 II 2 , © 1 , ©2 G Br[Q*) 
XHiQi) - V2h(02)||2 < L 2 ||©1 - © 2 II 2 , ©1, ©2 G Pr-(©*) 
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2. Let r satisfy 


sup F(Pm^(v2 ^(0 *)Pm(PmV2 £(0 *)Pm)+Pm^ - z)) < r. 
zeAul 

y is a continuous function of z, so a finite r exists. 

Theorem 4.10.1. Suppose we are given samples from the mixed model with 

unknown parameters 0*. If we select 

(niaXgsGb|)log|G| 
n 

and the sample size n is larger than 

+ ^)2(niaXg6G \9\)\A\ log |G|, 

then, with probability at least 1 — 2( max^gc bl) exp(—cA^n), the optimal solution to (4.6.1) 
is unique and model selection consistent, 

^ ||0 _ 0*||2 < A (Ahi) ^/ 256Li|A|(max^gG |g|)log|G| 

2. Qg = 0, g e I and Og ^0 if ||0g||2 > (l + 

Remark 4.10.2. The same theorem applies to both the maximum likelihood and pseudolike¬ 
lihood estimators. For the maximum likelihood, the constants can be tightened; everywhere 
Li appears can be replaced by Li/128 and the theorem remains true. However, the values of 
T,T,m, Li, L 2 are different for the two methods. For the maximum likelihood, the gradient 
of the log-partition VA(0) and hessian of the log-likelihood V^f(0) do not depend on the 
samples. Thus the constants T,T,m, Li, L 2 are completely determined by 0* and the like¬ 
lihood. For the pseudolikelihood, the values of r, r, m, L 2 depend on the samples, and the 
theorem only applies if the assumptions are made on sample quantities; thus, the theorem is 
less useful in practice when applied to the pseudolikelihood. This is similar to the situation 
in Yang et al. (2013), where assumptions are made on sample quantities. 


2y/2mLiT y 
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